Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retention Policy like Time Machine [$50] #2084

Closed
gorbiWTF opened this Issue Nov 3, 2016 · 41 comments

Comments

Projects
None yet
@gorbiWTF
Copy link

commented Nov 3, 2016

I have a suggestion but I don't know how complicated this would be to implement: A retention policy like Time Machine.

From Wikipedia:

Time Machine saves hourly backups for the past 24 hours, daily backups for the past month, and weekly backups for everything older than a month until the volume runs out of space. At that point, Time Machine deletes the oldest weekly backup.

Configurable, of course ;)


There is a $50 open bounty on this issue. Add to the bounty at Bountysource.

@jrast

This comment has been minimized.

Copy link

commented Nov 3, 2016

I was thinking about this as well. This would be a very nice feature and requires a lot less storage than keeping all backups for a given duration.

This strategy is also known as "Grandfather-father-son" strategy: https://en.wikipedia.org/wiki/Backup_rotation_scheme#Grandfather-father-son

Here you can find a implementation in shell script for a database backup strategy: http://processmakerblog.com/backup/the-father-the-son-and-the-grandfather-a-basic-backup-strategy/

@kenkendk kenkendk changed the title Retention Policy like Time Machine Retention Policy like Time Machine [$10] Nov 3, 2016

@kenkendk kenkendk added the bounty label Nov 3, 2016

@gorbiWTF

This comment has been minimized.

Copy link
Author

commented Nov 3, 2016

The "Tower of Hanoi" method sounds interesting too. Maybe with n targets?

@kenkendk kenkendk changed the title Retention Policy like Time Machine [$10] Retention Policy like Time Machine [$25] Dec 20, 2016

@kenkendk kenkendk changed the title Retention Policy like Time Machine [$25] Retention Policy like Time Machine [$30] Jan 9, 2017

@joolsr

This comment has been minimized.

Copy link

commented Jan 17, 2017

Yes I would find this very useful. I have daily backups going back nearly 9 months which I dont need and daily backups from a few days ago that I do ! Kenneth suggests that you can manually delete versions from the commandline using list to find out what versions you have

@ghost

This comment has been minimized.

Copy link

commented Jan 25, 2017

@gorbiWTF I guess the "Tower of Hanoi" would be harder te implement with incremental backups, as it will limit you greatly with the number of backups you can have. Where the "Grandfather-father-son" method has more opportunities to implement incremental backups.

@gorbiWTF

This comment has been minimized.

Copy link
Author

commented Jan 25, 2017

@extink I thought so too. Too complicated to implement (I guess?) and to understand for most users (including myself).

@kenkendk

This comment has been minimized.

Copy link
Member

commented Jan 25, 2017

One of the ideas I discussed with @renestach was to use a "spill over" strategy, where you can say "I want 4 daily backups, 2 weekly, 3 monthly", and then whenever you have 4 daily backups, the oldest "spills" to become a weekly and so on.

Beautifully illustrated with a mechanical clock:
https://www.youtube.com/watch?v=UHBHCsrqYMw

@joolsr

This comment has been minimized.

Copy link

commented Jan 25, 2017

@agrajaghh

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

I really like the setting of Crashplan, I guess this is the spill over strategy?

grafik

@kenkendk kenkendk changed the title Retention Policy like Time Machine [$30] Retention Policy like Time Machine [$50] Mar 1, 2017

@1activegeek

This comment has been minimized.

Copy link

commented Apr 14, 2017

I'll add another +1 to this request. This is exactly the grandfather-father-son methodology that most modern backup softwares employ. I currently work for an organization who sells an Enterprise Only Cloud based backup solution (hence I can't use it at home for free). We use this exact mechanism to limit backups so that you can retain dailies for X period, weeklies after that for X period, and monthly after that for X period. Alternatively you can still achieve maintenance of Infinite with all 0's, or just straight dailies for 365 days to equate a single year worth of EVERYTHING. It fits all broad sorts of use cases for folks.

Versions aren't tracked in the backups as I saw some comments on other threads of similar idea to mix/match. That's a recipe for disaster and tracking becomes difficult. I would also make sure to separate the use of the word Version for it's true meaning whether related to actual snapshots (in which case that's what we're talking about with G-F-S method) vs File Versions akin to things like MS OneDrive or Google Drive individual file revisions. Unless you are running backups at an extremely frequent rate, the likelihood of actually grabbing all file versions, is low. If I make 5 edits in 1 day, unless I had backups run multiple times during the day to catch each revision, they won't be in my backup sets. If they are, well then you don't need to worry about losing the versions, they'll be in the backup sets. I'd say the chance of needing a "version" of a file that is outside your normal retention policy, but with your "version" policy, is slim to none.

If you need versioning that bad, setup a git repo to track changes on versions. ;)

@mohakshah

This comment has been minimized.

Copy link

commented Sep 3, 2017

@kenkendk Wouldn't using a spill-over strategy with a configuration such as 4 daily backups, 2 weekly, 3 monthly result in weekly backups being rotated every 4 days as the daily backups spill over, and monthly backups being rotated every two weeks?

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2017

Alright, so want to give implementing this a try. Though before writing any code, I'd like to hear a few opinions from other potential users of this feature:

I decided to go with the easier Grandfather-Father-Son method as mentioned above. The idea is to add an advanced option "keep-staggered-versions" which allows the user to specify multiple pairs of time frames and intervals. For example:

  • for 2 days : keep backups in 1 hour interval
  • for 30 days : keep backups in 1 day interval
  • for 90 days : keep backups in 2 days interval
    and so on.

I don't really like the idea of having to specify the amount of backups to keep, since that doesn't explicitly say anything about how distributed the backups should be within the time frame. Hence I'd go with the interval approach.

Here are some things maybe worth discussing:
How would you expect the time frames to behave? Should each time frame always count from day 0 or does the 30 days time frame mentioned above start after the 2 days time frame and the 90 days after that, effectivley resulting in 2, 32 (2+30), 122 (2+30+90) days as time frames? Coming from Crashplan and their wording in the GUI I'm kinda biased and would chose the first option, meaning the 90 days time frame really ends at 90 days from "now".

Is there any preference for the format of the option parameter? I'd go with a comma seperated list where the values of each pair are seperated by collons. For the three time frames above that would be 1d:1h,30d:1d,90d:2d
The format for the time frames and intervals is the same as for --keep-time: s=seconds, m=minutes, h=hours, d=days, ...
An interval of 0 would mean keep everything in that time frame, which I though might be useful.

Should there be a default configuration that will be used when setting the option to the value "default"? If so, what time frames and intervals should be used as default?

What should happen to backups outside the specified time frames? I'm leaning towards not touching them at all, so the --keep-versions or --keep-time options can be used to ultimately delete the oldest backups.

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2017

@mohakshah I'm not sure how such a problem could be avoided in general. I think you'd have to make sure the next interval is a multiple of the former time frame. so for example 1 hour, 4 hour, 24 hours, 7 days would work.
But if you have for example 4 hours for the first and 6 for the second interval (which is not a multiple of 4) following will happen: The first 4 hour backup fits, the second gets discarded, since the interval between backups has to be 6 hours, third one fits again, effectivly resulting in an 8 hour interval compared to the specified 6 hours.

@gorbiWTF

This comment has been minimized.

Copy link
Author

commented Sep 5, 2017

@TekkiWuff
Your approach seems fine to me.
I like the time frame from Apples Time Machine: hourly backups for the past 24 hours, daily backups for the past month, and weekly backups for everything older than a month. This could be the default setting.

I don't know how other people do it, but with the counting of the time frames it should be like this (using your example): I'm doing hourly backups starting Monday at 08:00. Wednesday at 08:00 I got two full days of hourly backups. Now the first backup from Monday 08:00 becomes the first backup for the 1 day interval. Thursday at 08:00 the backup from Tuesday 08:00 becomes the second daily backup and so on. The second hourly backup from Monday 09:00 should be deleted on Wednesday 09:00...
I hope I understood everything and expressed myself clearly :)

An interval of 0 would mean keep everything in that time frame, which I though might be useful.

Absolutely!

@crazy4chrissi

This comment has been minimized.

Copy link

commented Sep 5, 2017

How would you expect the time frames to behave? Should each time frame always count from day 0 or does the 30 days time frame mentioned above start after the 2 days time frame and the 90 days after that, effectivley resulting in 2, 32 (2+30), 122 (2+30+90) days as time frames? Coming from Crashplan and their wording in the GUI I'm kinda biased and would chose the first option, meaning the 90 days time frame really ends at 90 days from "now".

It should count from 0. It is more intuitive, as in the user's mind, he thinks "I want one backup per hour within 24 hours and one backup per day in the last week" - and with last week he does not mean 24 hours + 1 week. There is another problem with the other approach: Assume the time frame starts from the end of the last one. Then he defines a 24 hours frame with 1 backup per hour and a 144 hour (6*24) frame with 1 backup per day. First problem, he needs to think twice about the 7*24-24=6*24=144. Now, he decides that he needs a backup per hour not only within a 24 hour frame, but within a 48 hour frame, but still 1 backup per day within one week. Now he needs to adjust the first and the second frame: The first one is 48 hours instead of 24, the second one is 7*24-48=5*24=120 hours.

In the web gui, a graphical interface could help to make it intuitive for the user to define the time frames:
duplicati_time_frames
In this example, the user has defined a red time frame for 24 hours, a blue one for the first week except the first 24 hours, and a green one for the first month, except the first week. The user can add new frames and drag the borders of the time frames to adjust them. I think this interface makes it completely intuitive how the time frames work. The interface could adjust the x axis dynamically, depending on what the user defined so far. Of course this might involve some JS work and should not be the first priority. First make it work, then make it nice. But maybe there already is a nice jQuery Plugin.

@crazy4chrissi

This comment has been minimized.

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2017

@gorbiWTF I'm totally with you on this. If you specify a 24 hours time frame with 1 hour interval, then you get a full 24 hour sliding windows from "now", just as you described it. So at 8:00 you'd have 24 backups up to 8:00 the previous day. Later at 15:00 you'd have 24 hours up to 15:00 the previous day and so on.
There are of course some special cases. Assume you haven't turned on your pc for a while. You start it at 7:30 and Duplicati immediately creates a backup and then returns to its normal hourly schedule. So the next backup gets created at 8:00. While this is only 30 minutes later, it will be kept, since the newest backup will never get deleted. At 9:00 though, the 8:00 backup gets deleted, since it's not protected anymore by that policy and the time difference between the 7:30 and 8:00 backup is too short for the specified interval. So you end up with a 90 minutes difference between the 7:30 and 9:00 backup. The backup at 10:00, 11:00 and so on are then handled as usual again.

@crazy4chrissi Good that we agree on not adding up the time frames. I also think it's more intuitive. About the GUI design. These slidy things look quite neat but I'm a bit worried about the representation of short time frames (1 day with 1 hour interval) vs long ones (3 years with 1 month interval).
But it will probably take a bit anyway till I (or somebody else) starts bringing this feature into the GUI. First I want this to work correctly in the backend ;)

My current status is: I just couldn't hold myself back any longer and started working on it this evening. It seems to already work correctly in the debugger. Of course I will still incoorperate every change that might result from a discusion in this thread, but I had to put my idea finally into code. ^^
The next step for me is trying to figure out how to replace my currently installed version with this dev version, so I can have it running for a few days as a "long term" test.

@zoaked

This comment has been minimized.

Copy link

commented Sep 6, 2017

@TekkiWuff - I'm willing to help out with development of the GUI and/or being a beta tester

@JonMikelV

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2017

I'm available for beta testing. And if I can figure out how to add a few more incentive bucks I'll toss that in there too. :-)

@gorbiWTF

This comment has been minimized.

Copy link
Author

commented Sep 6, 2017

@TekkiWuff Hm, I didn't think of this problem when the computer isn't turned on. Maybe just ignore the "missed" backup and do the next backup according to the settings, so at 08:00 and not at 07:30. Then it would be at least consistent.

Also, I think this feature:

If a date was missed, the job will run as soon as possible.

should be an checkbox option instead.

@crazy4chrissi

This comment has been minimized.

Copy link

commented Sep 6, 2017

Maybe just ignore the "missed" backup and do the next backup according to the settings, so at 08:00 and not at 07:30. Then it would be at least consistent.

This might seem a good idea for a 1 hour interval, but for a 1 day interval, this might end up in a situation where no backup is ever done. Assume the backup is scheduled for 08:00 and for one week, you are always a little late in office around 08:30, you will always miss the scheduled time and thus no backup would run for the whole week.
So maybe make it optional, but definitely keep the default to do backups as soon as possible if one was missed.

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2017

@gorbiWTF @crazy4chrissi This feature is actually already part of Duplicati. At least my backups always start at uneven times like 19:37 (when i turn on the PC) and then continue nicely at 20:00, 21:00, ... I just wanted to give a heads up that there might be cases where you end up with backups that have a bigger time difference than the interval you specified.

@zoaked Good to know! Getting into C# is one thing coming from Java, but GUI development with HTML/JS is a whole new thing for me ^^.

Regarding beta testing
I forked the project yesterday and created a branch with the first commit here https://github.com/TekkiWuff/duplicati/tree/staggered-versions
Though there are still some things in there I want to change, most importantly logging of time frames, intervals and what gets deleted for debugging and logging warnings or aborting if the configuration for this feature couldn't get parsed or is invalid.
So if you are familiar with compiling and running Duplicati from source, then you can already test it. Otherwise it will probably take a few more days before I can publish a binary to install.

So far I only tried running the compiled Duplicati Command Line Tool by adding the option "--keep-staggered-versions=1D:1h,1W:1D,1Y:1W". This should keep 1 hour backups for the first day, 1 day backups for the first week, 1 week backups for the first year. Every backup outside that range should not be touched at all.

Just as a warning: Since this feature is intended to delete certain backup versions, it might be wise have a backup of your backup or test it with a completly new configuration. And if you run it on a large existing backup and depending on how often files in your source directory changed, Duplicati might create a lof traffic the first time when it tries to compact the remaining backups after deleting many backups.

@JonMikelV

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2017

@TekkiWuff , are you also implementing the "Remove Deleted Files" option seen in agrajaghh's image?

Also, I'm not sure how Duplicati is set up for localization - should new parameters such as you're developing support alternative language options for each of the time and date periods abbreviations?

@kees-z

This comment has been minimized.

Copy link

commented Sep 6, 2017

I've some questions about this:
Suppose I schedule backups daily at 8:00. I want to keep all backups from the last week, 2 weekly backups for 3 weeks before the last week and one weekly backup for all older backups. The suggested syntax would be --keep-staggered-versions=1W:1D,4W:2D,100Y:1W.
The first backup is started on a Monday.

Sometimes nothing has been changed since the last backup. In that case, nothing is uploaded and there is no backup for that day, the most recent backup is still valid in that case. This is the way Duplicati behaves (unless you tell it to Always upload backups).

So a list of available backups could look like this ("*" is a skipped backup, because no source files have been changed):
MT*TF*S*TW**SSMT**F*S**WTFSS*T*TFSS

This line represents the backups from the last 5 weeks. In the first week, backups of Wednesday and Saturday are missing, the second week has no Monday, Thursday and Friday backup, etc.
My questions are:

  1. You specify how many backups are kept during a time frame, but how must Duplicati decide which backups should be kept in that time frame? For the last week, for every day a backup should be kept, but no more than 5 backups are available.
    To save 2 backups for the second week in the timeline above, Duplicati could keep the Monday- and Thursday backups. That would work for the first week, but the same method would delete all backups of the second week in the timeline. Keeping the first 2 backups for a week is not the best choice for week 3, because both Monday- and Tuesday backups are available. That would delete all backups from Wednesday to Sunday in week 3. How can be specified which backups are the best ones to delete in a time frame?
  2. Available backups are irregular, because a new backup is not created when nothing changed since the last backup. If backups are deleted from the most recent backups, how should Duplicati decide which backups to delete from the last month? I guess there's not much to automate if there isn't any structure in the available backups. How should Duplicati handle a set of backups in a time frame without fixed intervals?

In summary: can a robust mechanism be designed that keeps the correct backups and deletes only backups that are really redundant? This mechanism should be controllable with a clear, easy-to-understand command line option but it should be predictable which backups are kept and which ones will be deleted. This seems quite complicated to me.

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Sep 7, 2017

@JonMikelV I'm honestly not quite sure how much influence I can have on that. The code I'm working on pretty much just uses an already existing function to delete backups, just as if you'd use the command line util to delete a specific version. I didn't intend to rewrite any logic regarding the handling of deleted files.

To my understanding Duplicatis saves a list of all files each time a new backup is created. If the file you deleted is still referenced in any of the backups that exist, then it would still be recoverable.
So it's basicly like saving a snapshot of the files you want to backup onto a new DVD each day. Then once a week you throw away all but one of the daily DVD and once a month you go through the weekly ones and also throw away all but one of them to save space. That means while you still have backups of your old data, your chances of recovering deleted files slowly decreases, espcially when these files only existed for a short period of time, since they might only have been on a few of the DVD you already scrapped.

Regarding localization: I try not to reinvent the wheel and thus am using the same format as the "keep-time" option is using. As far as I can see, there is no localization for the parameters. It's always s, m, h, D, W, M and Y for second, minute, hour, day, week, month and year. What can and should be localized is the short and long description for the new option that will be show in the Advanced options

@kees-z

You specify how many backups are kept during a time frame

In my first post I wrote

I don't really like the idea of having to specify the amount of backups to keep, since that doesn't explicitly say anything about how distributed the backups should be within the time frame. Hence I'd go with the interval approach.

So you specify the minimum interval between two backups, not the number of backups to keep. If you have two backups that are 55 minutes apart of eachother while you said the keep only one backup per one hour, then one of them gets deleted. The algorythm won't make any assumptions about which backup might be "the best" to keep but simply deletes backups that are too close to eachother.

The idea for the algorythm is as follows:

  • protect the newest backup from deletion by removing it from the list of potential backups to delete
  • put each backup into the time frame group according to its age
  • for each group do:
    • take the oldest backup in the group as the first backup to keep.
    • advancing towards the newest backup, for each backup do:
      • check if its date and time is far enough away from the previous kept backup up, if so then keep it, otherwise delete it.

I went through your example and tried to show which backups are kept and which deleted with each new backup in the following picture. It might look a bit overwhelming at first but I hope it clears up some questions in the end.

kees-z_example

@kenkendk

This comment has been minimized.

Copy link
Member

commented Sep 8, 2017

@TekkiWuff Awesome work! I have not worked on it because there are so many corner cases, and I think your graphic shows both how you deal with it and also how many cases there are.

If you think the commandline thing works well, we should merge it and maybe just add a text box in the UI until we get a better solution.

@kees-z

This comment has been minimized.

Copy link

commented Sep 8, 2017

@TekkiWuff Thanks for your detailed explanation and the great visialization!
I missed the part about specifying a minimal interval instead of an amount of backups to keep. Your approach simplifies the procedure and makes it much easier to catch it in a straightforward algorithm.
Thanks for clearing that up!

@gorbiWTF

This comment has been minimized.

Copy link
Author

commented Sep 8, 2017

I don't understand the graphic at all, I'm sorry.

For the last two hours I "studied" the Wikipedia article and made this graphic. @TekkiWuff is going to implement the Grandfather-Father-Son principle, is he? Then it should be like this:

dupli

This is basically how I would want to do it: daily backups for one week, weekly backups for a month, monthly backups forever; only Monday to Friday. D1 on Monday 1. replaces D1 on Monday 8. but only after it successfully finished. So on with D2 to D4. On Friday 5. January, my last backup for the week is created (W1) and it gets replaced on Friday 2. February.

I guess the difference is that Wikipedia says that the youngest backup should be kept, not the oldest like I said before.

But I think the problem is that you would have to somehow "mark" every backup if it is a daily, weekly etc. backup. And if I miss the backup on the 31.01. I wouldn't have a monthly backup. So on 01.02. the routine would have to set the backup from 30.01. as the monthly backup, otherwise it would be deleted on 06.02. Same goes for every other backup. The time starts after the first successful backup and after one week there must always be a weekly backup (the newest as possible) even if I miss 4 out of 5 days, so it doesn't get deleted next week.

I don't know if this makes any sense though. I will shut up now :D

@mohakshah

This comment has been minimized.

Copy link

commented Sep 8, 2017

@gorbiWTF while contemplating on this issue, I came across the same problem, i.e. the need to "mark" the backups. One of the ways I thought it could be done was by uploading a separate "versions" file to the backend that would contain this metadata. For example, there could be a duplicati-versions.json.zip file which could contain the metadata like follows:

{
  "daily": [
    {
      "index": 0,
      "time": "20170819T112438Z"
    },
    {
      "index": 1,
      "time": "20170818T112438Z"
    },
    {
      "index": 2,
      "time": "20170817T112438Z"
    },
    {
      "index": 3,
      "time": "20170816T112438Z"
    }
  ],
  "weekly": [
    {
      "index": 0,
      "time": "20170813T112438Z"
    },
    {
      "index": 1,
      "time": "20170817T112438Z"
    }
  ],
  "monthly": [
    {
      "index": 0,
      "time": "20170720T112438Z"
    },
    {
      "index": 1,
      "time": "20170616T112438Z"
    },
    {
      "index": 2,
      "time": "20170516T112438Z"
    }
  ],
  "yearly": [
  ]
}

This is how a "4 daily, 2 weekly, 3 monthly and 2 yearly" backup versioning could be handled. This could make versioning quite manageable when using the grandfather-father-son strategy. Although, it would probably require more changes to the codebase. Plus, @TekkiWuff's strategy will easily accommodate with the existing backups.

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2017

Great to see there there is such a lively discussion about this topic. :)

@gorbiWTF Hm maybe I misunderstood the wikipedia paragraph then. I apologize! :-/

My approach probably can best be describe as "spil over" as others in this thread already called it. The "daily bucket" would have daily backups at an interval of 24 hours. Once a backup is over a week (7 days) old, it spills over into the "weekly bucket", but only if no other backup in the last seven days spilled over. The same then happens with the "weekly bucket". Once a backup is older than a month (31 days), it spills over into the "monthly bucket" or in your case the "forever bucket", but only if no other backup has spilled over into that bucket in the last 31 days before.

From your perspective the problem will probably be, that this cannot guarantee that the weekly backups are always on a certain day of the week, like for example friday. Likewise with the monthly backups.
Just out of curiosity: How important is it to have this GFS-principle as you described it? What would be the drawbacks of my proposal for your specific usecase?

@kenkendk Thanks for the praise. While my suggested solution runs locally now for a few days without a problem, I ponder about if it's ready to release. It feels like there are still some questions open for discussion and I certainly don't want to quickly force my implementation onto other users that might expect something different, like for example gorbiWTF.

Reading JonMikelV's question earlier this week also made me worried that I'm not sure if users will be aware of the following: For deleted files there is the chance that they can't restore the most recent version before the file was deleted or, even worse, might not be able to recover the file at all, if it only existed briefly in one backup, which got deleted.
This is due to the fact that whole backups get deleted instead of applying the rules on each and every file individually. If I'm not mistaken this would potentially require downloading, modifying and uploading many *.dlist files instead of just deleting them and I'm not sure if that is feasable.
On the other hand it might very well be, that a deleted file never gets removed, if the file just so happens to be in a monthly backup which is kept for years according to the specified configuration.

Despite this, I updated the code in my repo, in case anyone wants to have a look at it. Though I mostly just added logging and did some refactoring. The code still works the same.

@JonMikelV

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2017

@gorbiWTF

This comment has been minimized.

Copy link
Author

commented Sep 10, 2017

Well, from my perspective a week is always from Monday to Sunday and not the last 7 days. A month is also from the first day to the last day of that month, not the last 30 days. A weekly backup should be from Sunday, if there is no backup from Sunday (PC wasn't on) it should be from Saturday and so on.

If I have buckets that spill over, the last daily backup would spill over to become the weekly backup. The only difference I see is the counting of intervals as you count the number of backups (a week is just 7 backups) and I count the number of days (on every Sunday a week is done, no matter the amount of backups).

If I do daily backups and the bucket is full after six days, the 7th backup spills over and becomes the weekly backup, right? You then empty the bucket (delete the first six backups), do you? Then backup 1 to 6 would be suddenly gone. I would replace backup 1 with backup 8...

As for me, I prefer my approach, because it makes more senses to me and it works better how I want to use Duplicati at work (only 5 days a week) but both approaches seem fine to me when it's just about saving space.

@zoaked

This comment has been minimized.

Copy link

commented Sep 10, 2017

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2017

Back again at this. Sorry for the delay.

@gorbiWTF

You then empty the bucket (delete the first six backups), do you? Then backup 1 to 6 would be suddenly gone

No, you'd always have 7 days worth of daily backups (of course only if you backup regularly). The buckets don't get emptied but are rather a sliding window. So for example on any Wednesday you'd still have all daily backups till the last Thurday

Try having a look at my previous example picture, specifically at the line that starts with "S". There you have 7 days worth of daily backups (green area). On the next day (line below starting with the new backup "T") the backup "M" spills over into the bucket which keeps backups only every second day (yellow area). Since it is more than two days away from "M", "K" is kept.
Again a day later the backup "N" spills over into the yellow bucket. But since it is less than two days away from "M", "N" gets deleted.

The result is indeed kind of similar to what you want, but without the guarantee that backups will always be at a certain day (every sunday, every last day of the month, etc).
I haven't put much thought yet into how your idea could be implemented though. Maybe @mohakshah's idea would be it.

@JonMikelV Sure, it's possible to change the parameter to require the name of an algorithm first. It's probably more a matter of user experience, especially when this feature is supposed to be shown in the GUI at some point. If you have several algorithms you will have to write lengthy explanations about each and how they work and I guess there probably will be one forum post per week about which one is better and so on ^^. Definitely not something I'm too keen on.

Or at least deleting a block of a file is

Not quite sure what you mean with that. Sure, the whole point is to delete backups and thus blocks, but Duplicati doesn't delete a block that is still referenced by another file or version of a file. So deleting backups merely deletes references to blocks but that doesn't mean it will delete blocks that are needed by files which are still in other backups. Or in short: You wont have currupt files in other backups due to this feature.

@zoaked and @JonMikelV (again ^^)
I'm with you, that having this feature on file base instead of backup base would be better, but I don't feel very confident in implementing such a big change single-handedly.

There are two big areas coming to my head right away which have to be changed:

Updating the *.dlist files
I for example have a folder in which files often get created and deleted, thus each time a backup is created there is at least one file, that as been deleted. Now if I would change the code to NOT delete backups, that contain the latest version of any deleted file, I could basicly NEVER delete any backup.
The solution would be to run the algorithm on individual file versions to remove them from backups and then re-create and upload updated *.dlist files while deleting the now outdated ones.

Potential problems I see with this:

  • Transfer size: If you have been gone on holidays for a week or two like me, then the next time you run a backup and thin out your file versions, it has to delete and upload many new *.dlist files. In my case that would be 30MB per *.dlist file times 8 backups a day times 8 days = 2.24GB. Not sure how (un)desirable that is.
  • Slow performance: As many have noticed, the file selection in the restore dialog can be very slow if you have many files and backups. The algorithm probably has to access the data in a similar way for ALL files in ALL backups. Of course I cannot say for sure how fast or slow it will be without an actual implementation.
  • Consistency: The changes have to be done both in the local database as well as in the remote storage. But this might not be such a big deal, since Duplicati has to deal with this during normal backup situations already anyway.

The GUI
The restore process currently works on a backup basis. You select a backup and then can browse in or search through the files in that specific backup. The whole process suggests, that what you see is a snapshot of what existed at the time of the backup.
If you are starting to delete single file versions from within a backups while keeping versions of other files in that backup, the backups won't be consistent snapshots anymore. The solution might be to let the user choose an arbitrary time and date and show the user what is the most up to date backuped version of each file and directory (like crashplan). This definitely would have to be discussed before.

Looking forward to get more input on all this from you guys :)

@meersjo

This comment has been minimized.

Copy link

commented Oct 3, 2017

Not entirely sure how far implementation has come here; but since the old link to the shellscript implementation is gone, I can offer an ancient shell implementation of my own if there is interest in it - does pretty much what is described here, but I called it "decremental retention" back then.

One difference, I think, is that the sliding window is calculated not from now(), but from the date of the most recent entry - so if your backup doesn't run for a week, nothing will expire, either.

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2017

@meersjo
I submitted a pull request a few days ago as kenkendk suggested, see PR #2781
It contains the algorithm as I tried to explain in #2084 (comment), since it still makes the most sense to me and for my usecase. Though, as the whole discussion here has shown, there is definitely more than one POV on how to do it "right" ^^

But I see no reason why you should hold back your scripts, in case people don't like what I did or the PR never actually gets accepted :)

@meersjo

This comment has been minimized.

Copy link

commented Oct 3, 2017

I think (I wrote this back in the early 2000s...) my script works on very similar principles, except it starts counting backups-to-keep from the most recent one in each timeframe - the reasoning being that that one is going to hold the most up-to-date versions of each file, so you'd lose the least work.

Said script is at https://github.com/meersjo/toolkit/blob/master/various/datedirclean.sh .

@TekkiWuff

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2017

Just a quick update: The feature is now available via the retention-policy option as of version 2.0.2.10_canary_2017-10-11
I went with my initial implementation as described in #2084 (comment). I'm sorry if it doesn't fit everybodies needs. Maybe it can be extended by others in the future.

If anybody is more experienced with extending the Duplicati GUI than me, then please don't hesitate bring this option into the GUI. There were already some ideas further up, which might be a good start.

@gorbiWTF

This comment has been minimized.

Copy link
Author

commented Oct 16, 2017

Is this correct?

s = second
m = minute
h = hour
D = day
W = week
M = month
Y = year

Is there any limitation on how many "options" I can put there?

@kenkendk

This comment has been minimized.

Copy link
Member

commented Oct 16, 2017

Yes, those are the correct denominators. There is no limit to how many retention rules you add.

@canove

This comment has been minimized.

Copy link

commented Oct 21, 2017

Before Duplicati, i used to run CloudBacko for my backups. They have a nice advanced retention policy, with this same intention.

The GUI looks like this:
image
image
image
image

With this configuration, i take Daily backups for 7 days, Weekly backups for 4 weeks and Monthly backups for 6 months.

Duplicati is awsome, always getting better and better!
Good job!

kenkendk added a commit that referenced this issue Jan 22, 2018

Added basic retention to the UI.
This covers #2084 but with a simpler UI than what is proposed.
@kenkendk

This comment has been minimized.

Copy link
Member

commented Jan 23, 2018

I am closing this, as we now have a (simple) UI for the feature.
@TekkiWuff feel free to grab the bounty:
https://www.bountysource.com/issues/38912971-retention-policy-like-time-machine

Others: If you want a more fancy UI for choosing the retention steps, feel free to create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.