Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local database should store file paths in a space-optimized fashion #1283

Closed
bks opened this issue Feb 9, 2015 · 2 comments

Comments

Projects
None yet
4 participants
@bks
Copy link

commented Feb 9, 2015

Using Duplicati 2.0.80, I have a ~75 GB backup set, but the local database is taking up 1.999 GB itself.

From poking at %APPDATA%\Duplicati\LRELCBKGXP.sqlite with sqlite3, it appears that a great deal of that space is used storing the full path for every file in the database. Running SELECT sum(length(Path)) FROM File; reports that the actual path strings occupy ~572 MB not counting overhead. Running DROP INDEX FilePath; and comparing the .sqlite file size before and after indicates that the index uses ~730 M, so the total space cost of the path strings is probably closer to 600 MB in the main table and then the same amount again in the index.

I did a bit more analysis of the paths using python:

tot, paths = 0, 0
for name in db.execute('SELECT Path FROM File'):
     tot += name[0].rindex('\\')
     if name[0].endswith('\\'): # this File is for a directory
         paths += len(name[0])
print 'Total redundant path components: %d, total unique path components %d' % (tot, paths)

It tells me that the leading path components contribute ~418 MB of data for only ~36 MB of unique leading paths; the 418 MB is then doubled by the FilePath index.

So, I suggest optimizing the storage of file paths. One simple way is to partially dematerialize the full path into dirname and basename parts:

CREATE TABLE "Dirname" (
      "ID" INTEGER PRIMARY KEY,

 -- this creates an implicit index, but the UNIQUE constraint is important
      "Dirname" TEXT UNIQUE NOT NULL
);

CREATE TABLE "File" (
 (
        "ID" INTEGER PRIMARY KEY,
        "Dirname" INTEGER NOT NULL,
        "Filename" TEXT NOT NULL,
        "BlocksetID" INTEGER NOT NULL,
        "MetadataID" INTEGER NOT NULL
);

-- If the UNIQUE constraint is really only (Dirname, Filename), it would be nice to take the other
-- columns out of the index. But if not, this is still a lot smaller than before.
CREATE UNIQUE INDEX "FilePath" ON "File" ("Dirname", "Filename", "BlocksetID", "MetadataID");

This fairly simple change will probably shrink the local database by 50+%.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/8410154-local-database-should-store-file-paths-in-a-space-optimized-fashion?utm_campaign=plugin&utm_content=tracker%2F4870652&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F4870652&utm_medium=issues&utm_source=github).

@kenkendk kenkendk added the enhancement label Feb 9, 2015

@kenkendk

This comment has been minimized.

Copy link
Member

commented Feb 9, 2015

Great investigation!

I had considered doing a full split of the path, but that gives very large chained lookups.
Your solution would work with minimal overhead, given that most folders contains more than a single file.

@rocifier

This comment has been minimized.

Copy link

commented Feb 9, 2015

Love it. In terms of an index... It would actually probably be more efficient both in storage space and program speed to index a single column which contains a hash. The hash could be custom built in the app to represent a combination of whichever columns need to be included to identify the record as unique. For example, a hash might be computed for Dirname + Filename and stored as an integer. When this record is later looked up, a temporary hash is generated eg. md5(Dirname + Filename) as integer; and used to retrieve the record from the database.

@renestach renestach added this to the preview_2.0.0.90 milestone Apr 9, 2015

kenkendk added a commit that referenced this issue Jun 3, 2015

@kenkendk kenkendk closed this in a8a32ea May 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.