Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store destination files in sub folders. #1374

Open
ironjohn opened this issue May 7, 2015 · 28 comments

Comments

@ironjohn
Copy link

commented May 7, 2015

In my experience with other backup software, we have seen slowdowns with destination folders containing thousands of individual files. One preventative step taken is to store backed up data in subfolders, creating a new one every XXXX number of files.

When it comes time to get a directory listing from say an FTP server you would be looking at 50 folders with 10k files in each, rather than 500k files.

This could be adjusted with a command line option, but really just setting the number of files per subfolder to 1000 in code would get the job done. 1k is a very conservative number and I'm sure anything south of 10k would be fine.

Caution for cautions sake pulls me to the lower numbers, but nothing is free and I'm sure we would see a trade off in db size or performance if the number of folders grew too high as well. Which reminds me of an issue you guys recently fixed with folder paths in SQLite.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@kenkendk

This comment has been minimized.

Copy link
Member

commented May 11, 2015

I have discussed it a bit with @renestach in the past.

Many protocols have issues with very large file listings, and most filesystems suffer when the folders have many files.

If we were to use sub-folders, it would solve that issue.

However, there is a tradeoff, in that the code would need to traverse all the folders to find all the files.
Currently there is no notion of folders in Duplicati's backends, so that needs to change.

To support both sub-folder and non-sub-folder setups, there needs to be a marker file inside the root folder.

If the folder name is recorded during upload, it might be possible to use a simple naming scheme, like "folder-1". It would be simple to reconstruct the path from the local database, and the path can be recorded when a remote recursive scan is performed.

This can easily be extended to support sub-sub-folders etc.

Another approach is to use the unique filename, and then construct a hierarchy from that, i.e. the filename is "prefix-tXXYYZZNNNNNNN", corresponding to the path "XX/YY/ZZ/prefix-tXXYYZZNNNNNNN". That would leave a maximum of 256 folders on each level with a random distribution. This scheme works poorly with a low number of files, and there is currently no real support for rename/move on the remote backend, making it really expensive to change the layout afterwards.

@ironjohn

This comment has been minimized.

Copy link
Author

commented May 11, 2015

Warning: I have heavy brain fog today, so my math and logic may be off. My spelling sucks on a good day to boot. If it doesn't make sense feel free to correct me. Thanks in advance. :-)

I don't think you would need multiple levels of nested folders. In fact I can see that causing heartburn in the future if the rules governing sub-subfolder creation get cross wired and Duplicati goes hog wild with infinite folder nesting.

Consider naming the folders alpha numerically, in ascending order, starting with a folder named 1 and ending with ZZ. I would reserve 0 (and 00). I can see some future glitch in some OS or framework update causing problems with all 0's in a folder name being indexed. I'm probably over thinking that, but it is a data backup conversation.

After dropping the 0 and 00-0Z folders we would have 1,259 possible folders with just two characters at the XX position from your example. No need for subfolders if you consider how many files you can store. Assuming my 1000 files per folder suggestion (which is a good, safe minimum in my opinion, and very conservative) we would have over a million files in a single backup set without nesting.

We could add a third (again dropping the leading 0) and reach 44975 folders, but that would put us back where we started with the really large directory to list contents of. The third could be restricted strictly to a single character, such as 1 (100-1ZZ), giving us 12,960 folders.

With 1000 files per folder we get 13Mil (that's THIRTEEN FRIK'N MILLION) 50mb+ file chunks for a total of 648TB (HFS Batman!!!) of backed up family photos and daily PST file deltas.

@ironjohn

This comment has been minimized.

Copy link
Author

commented May 12, 2015

Yeah, so I must have been running a fever or something. This is some convoluted folder naming convention. Rube Goldberg is smiling down on me.

It's way easier to actually number the folders 1 - 9999, or any justified upper limit (or none?).

@Pectojin Pectojin closed this Feb 11, 2018

@kees-z

This comment has been minimized.

Copy link

commented Mar 9, 2019

I have some ideas for the logic of using remote subfolders that eliminate the drawbacks that were mentioned in this thread, discussed here.

@archon810

This comment has been minimized.

Copy link

commented Aug 4, 2019

Why was this closed?

@Pectojin

This comment has been minimized.

Copy link
Member

commented Aug 4, 2019

Inactivity, heavy refactoring/compatibility costs, and no clear path forward.

If anyone is interested in working on it we can reopen it. It just didn't seem like it was going to happen a year ago.

@archon810

This comment has been minimized.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 4, 2019

I'll be looking at it since it's something needed for a few of the backends.

@archon810

This comment has been minimized.

Copy link

commented Aug 4, 2019

Should the ticket be reopened then or is there a new one?

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 4, 2019

Should the ticket be reopened then or is there a new one?

Might as well leave this open and keep all related content together.

@archon810

This comment has been minimized.

Copy link

commented Aug 4, 2019

I don't have permission to reopen it, can someone who does, reopen?

@Pectojin Pectojin reopened this Aug 5, 2019

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

The idea would be for the pre-existing backups in a single-directory will continue to be single-directory as it may be costly to rearrange large backups on remote destinations.

I'm of the opinion all new backups always use subfolders. Is there any reason to not always use subfolders?

@archon810

This comment has been minimized.

Copy link

commented Aug 6, 2019

I don't see one. Having also just tried duplicacy, I found that it already stores its chunks in subfolders, so there's precedent.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2019

I have the local file backend and Mega.nz now using sub-folders. While also handling pre-existing backups without subfolders which will continue to not use subfolders.

It will take a bit more work to get every backend changed to utilize it.

I use the value of the filenames, since they are random, to place block and index files into sub-folders.

I've made it 2 levels deep. The first folder contains the starting "b" or "i" and then the next 3 characters. The second folder then contains the next 3 characters.

The "dlist" files remain at the root.

For example:
FILE: duplicati-b1775fb4398574c3eaba98eac2f763953.dblock.zip.aes
BACKEND: b177/5fb/duplicati-b1775fb4398574c3eaba98eac2f763953.dblock.zip.aes

Capture

LOCAL FILE SYSTEM:

Capture

@archon810

This comment has been minimized.

Copy link

commented Aug 7, 2019

My concern is the additional API requests to traverse such a two-level single-file hierarchy. API requests to online services tend to be slow and add a lot of overhead.

In comparison, duplicacy uses a single-level dir depth with 2 chars (hex) in each dir name and then multiple files in those dirs. I think it's a sensible approach that would still resolve the issue here as you end up dividing the storage into 256 dirs.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2019

Some backends might mean more API calls but for example the Mega API provides just 1 call which gets a complete list of -all- your files. So in this case there is no increase in API calls.

In addition, the reason I went with the folder scheme I did was because the folders can be calculated given the filename, possibly reducing any extra API calls depending on the backend providers API features.

@archon810

This comment has been minimized.

Copy link

commented Aug 7, 2019

Depends on the implementation, I guess. If optimizations are carefully considered and implemented, and the overall performance isn't severely impacted, 👍.

Personally, I'm using Box and Google Drive, so that's what I'd be concerned with in my tests.

@kees-z

This comment has been minimized.

Copy link

commented Aug 7, 2019

Nice!
Looks like we have a proof of concept that Duplicati is able to use remote subfolders.

However, I see some backdraws in your approach:

  • The most important one is the number of subfolders. In your approach, all backups, including larger ones, will result in an extensive list of (sub)folders containing one or a few files. The maximum number of subfolders is 4,897,620 ( 2 * 256 for the parent folders, 4096 for the subfolders). So after about 2.5 million DBLOCK/DINDEX pairs, you can be sure that uploaded files will be stored in an existing (sub)folder. The high number of subfolders will dramatically influence performance of all operations that require a remote file list (check backend files, recreate local db etc).
  • In your approach, the first character in the filename (i or b) could be skipped. This will allow creation of more parent folders (4096 instead of 512) and increase the total number of folders to 273 million).

Instead, I like folders that fill up to a (pre)defined maximum more.
The best approach I can think of, is calculating a checksum for every filename. Before the first upload, a random value is generated (for example a2d5). This will be the name of the subfolder.
We have to cheat a bit to upload the first batch of files to this subfolder. A random 24 charcater value for the filename could be generated. The last 4 characters could be calculated to match the current checksum.
If the maximum number of files is reached, a new checksum value is generated to fill up the next subfolder.
More background info about this approach could be found here.

A nice extension to this could be a few new advanced options for back and forth migration between subfolders and no subfolders:
--use-subfolders=auto|on|off
Default: auto
This will enable or disable subfolders for new backups. For existing backups, value auto will detect if subfolders are used and keep this setting for additional backups.
If this is set to on or off, migration will be performed duting the next backup (see below how migration could work).

--migrate-filecount=n
Default: 10
This will set the number of files that will be migrated during a backup operation. For larger backups, the migration could be done in batches. This will avoid download and re-upload all remote files.

Migration operation could work quite simple:
A DBLOCK or DINDEX file is downloaded and renamed to something that matches a generated checksum. The renamed file is uploaded to the matching subfolder.
Migration from subfolders to a single folder is even more simple: download a file from a subfolder and re-upload it to the root folder at the backend.
There are some challenges to resolve, like keeping track of the migration status and finding out if a backup should use subfolders or not during a migration, but these problems should be resolvable.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2019

  • The most important one is the number of subfolders. In your approach, all backups, including larger ones, will result in an extensive list of (sub)folders containing one or a few files. The maximum number of subfolders is 4,897,620 ( 2 * 256 for the parent folders, 4096 for the subfolders). So after about 2.5 million DBLOCK/DINDEX pairs, you can be sure that uploaded files will be stored in an existing (sub)folder. The high number of subfolders will dramatically influence performance of all operations that require a remote file list (check backend files, recreate local db etc).

Not sure I follow. The first level folders have "b" or "i" and the 3 chars so that is 161616=4096 possible folders; per "b" or "i" file type. Then the 2nd level folder has 3 chars so that another 4096 possible folders. Total folders would be 4096 *4096=16,777,216 total folders at 2 levels.

Also, 3 unique chars is chosen because we want to stay under the 32k per-folder file limit of systems like FAT32.

The levels of subfolders is configurable though. A single level providing 4096 folders (8192 among "b" and "i") might be perfectly fine for many and might be a good default.

Speed impact will vary. Since Mega pulls all items we have the direct call to a file and there is no reversing. The same would be for S3 and likely others.

Further testing will flush out impact and give an idea of the best default setting; 0,1,2 etc.

I've added a setting --backend-folder-depth

  • In your approach, the first character in the filename (i or b) could be skipped. This will allow creation of more parent folders (4096 instead of 512) and increase the total number of folders to 273 million).

The first char "b" "i" is nice in that it visually groups the type of file at the top folders. I take those first 4 instead of 3 because including the b/i adds some nice context.grouping to the top 1st level folders.

@kees-z

This comment has been minimized.

Copy link

commented Aug 8, 2019

Not sure I follow. The first level folders have "b" or "i" and the 3 chars so that is 161616=4096 possible folders; per "b" or "i" file type. Then the 2nd level folder has 3 chars so that another 4096 possible folders. Total folders would be 4096 *4096=16,777,216 total folders at 2 levels.

Sorry for the confusion, I did misread the number of characters in the parent folder names. You're correct, the maximum number of folders is 16 million.

However, this will even more spread the backup files over thousands of folders. In practice, there will bee a huge folder tree with one file in every folder. With 16 million possible locations, the chance that any 2 files end up in the same folder is negligible.
This will also impact performance: for every uploaded file, a new folder must be created (and deleted if the file is deleted).

We really need a mechanism that fills up folders up to a predefined limit before the next folder is created. The checksum-approach is an example of this, but maybe there are better approaches.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 8, 2019

The good news is it I think my solution provides a lot of flexibility where the method of placing the files can be adjusted easily and does not impact the backend code itself once I have a backend modified to handled subfolders.

For example even adjusting dynamically would work. 1-char at 1st-level, 2-char at 2nd-level and 3-char for any additional levels.

The overall solution will work even with a mix of file placements.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2019

There are backends that support listing files only at each directory level. So walking that structure would be inefficient with sparsely populated directories. I agree we should fill directories.

So these seem to be the requirements:

  • limit the total number of files+directory entries within a directory
  • fill a directory to the limited amount before creating a new directory
  • keep the directory structure flat

Let's say we use 0-9,a-z which gives us 3636=1296 directories and we'll have 2000 files per directory. For the 1st level directories that gives us 12962000=2,592,000 files on the 1st level.

We could also fill the root directory with 2000 files first and then move on to directories.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2019

I now have the logic in place to fill folders in a flat manner.

For example, with 2000 files per folder and 1000 folders per folder.
The first 2000 files go in the root “/” directory. The next 2000 files go in “/1/”.
The 25,654,396 file goes in “/12/826/”

I just ran a test with 10 files per folder and 10 folders per folder on 6GB of data at 50MB blocks:

Capture

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2019

I need to touch each type of destination to enable them to handle the sub-folders. Is there a particular order of preference for the back-ends for this work?

Done:

  • local file
  • Google Drive
  • Mega

In progress:

  • ftp-alternate
  • ftp

To be done, all remaining but in what order: [updated]

  1. OneDrive
  2. Box.com
  3. SharePoint
  4. ftp-alternate
  5. ftp
    ...
@kees-z

This comment has been minimized.

Copy link

commented Aug 12, 2019

I would recommend to prioritize backends that have a limit to the number of files per folder, like MS OneDrive, SharePoint and Box.com.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2019

I would recommend to prioritize backends that have a limit to the number of files per folder, like MS OneDrive, SharePoint and Box.com.

Good point. Thanks.

@BlueBlock

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2019

For anyone.. when a backend folder does not exist, why ask the user if they want it created during "test"?

Why not just create the folder if it does not exist? If a backup is run when a folder does not exist the backup will silently create the folder anyway.

@kees-z

This comment has been minimized.

Copy link

commented Aug 12, 2019

During Test in the Add backup wizard, the question is useful.
The specified folder might not exist because of a typo in the path. Not asking the question will result in creating unwanted empty folders at the backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.