Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Local database should store file paths in a space-optimized fashion #1283
Using Duplicati 2.0.80, I have a ~75 GB backup set, but the local database is taking up 1.999 GB itself.
From poking at
I did a bit more analysis of the paths using python:
tot, paths = 0, 0 for name in db.execute('SELECT Path FROM File'): tot += name.rindex('\\') if name.endswith('\\'): # this File is for a directory paths += len(name) print 'Total redundant path components: %d, total unique path components %d' % (tot, paths)
It tells me that the leading path components contribute ~418 MB of data for only ~36 MB of unique leading paths; the 418 MB is then doubled by the FilePath index.
So, I suggest optimizing the storage of file paths. One simple way is to partially dematerialize the full path into
CREATE TABLE "Dirname" ( "ID" INTEGER PRIMARY KEY, -- this creates an implicit index, but the UNIQUE constraint is important "Dirname" TEXT UNIQUE NOT NULL ); CREATE TABLE "File" ( ( "ID" INTEGER PRIMARY KEY, "Dirname" INTEGER NOT NULL, "Filename" TEXT NOT NULL, "BlocksetID" INTEGER NOT NULL, "MetadataID" INTEGER NOT NULL ); -- If the UNIQUE constraint is really only (Dirname, Filename), it would be nice to take the other -- columns out of the index. But if not, this is still a lot smaller than before. CREATE UNIQUE INDEX "FilePath" ON "File" ("Dirname", "Filename", "BlocksetID", "MetadataID");
This fairly simple change will probably shrink the local database by 50+%.--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/8410154-local-database-should-store-file-paths-in-a-space-optimized-fashion?utm_campaign=plugin&utm_content=tracker%2F4870652&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F4870652&utm_medium=issues&utm_source=github).
Love it. In terms of an index... It would actually probably be more efficient both in storage space and program speed to index a single column which contains a hash. The hash could be custom built in the app to represent a combination of whichever columns need to be included to identify the record as unique. For example, a hash might be computed for Dirname + Filename and stored as an integer. When this record is later looked up, a temporary hash is generated eg. md5(Dirname + Filename) as integer; and used to retrieve the record from the database.