Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 1MB default block as suggested by users. #4925

Merged
merged 1 commit into from Apr 21, 2024

Conversation

mr-russ
Copy link
Contributor

@mr-russ mr-russ commented Apr 22, 2023

This still allows 50 blocks for the default upload size. It reduces the blocks to track by a factor of 10.
If you are using larger backups, you want a larger block size.

If you have small backups and it's a problem, recreating those will take less time.

This still allows 50 blocks for the default upload size.
It reduces the blocks to track by a factor of 10.
If you are using larger backups, you want a larger block size.

If you have small backups and it's a problem, recreating those
will take less time.
@mr-russ
Copy link
Contributor Author

mr-russ commented Apr 22, 2023

Fixes #4629

@ts678
Copy link
Collaborator

ts678 commented Apr 24, 2023

I've been pushing for this unless someone thinks that all the SQL can get a whole lot more tolerant of too-many-blocks soon. Alternatively, maybe someone can think of a way to quiz the user on size. Because I'm doubtful on both, I like this, provided it doesn't run the awful risk of throwing a hand in the face the way imanual blocksize changes tend to do. Because I don't know exactly how all the remembering and checking gets done, I'd encourage a lot of scenario testing. What's been tested thus far?

Technically there's probably a change to the manual needed. It's a separate project but does take PRs if people can remember.

@mr-russ
Copy link
Contributor Author

mr-russ commented Apr 25, 2023

Everything here seems hard. If a default change needs through regression testing because all existing backups will break, I've misunderstood the scale of design problems duplicati has.

SQL is complex and the 'don't use any ram' current design makes it difficult to understand the limits to work to. It needs a large redesign. Just making hashes binary will reduce on disk and in ram footprints of hases. But there are other problems too.

The way to move from the current stalled state to moving is to be agile, accept and merge small improvements to allow development to move forward. The fix guess from the user is a larger project which requires some thought. Ideally, the first backup would guess based on size of initial backup. I am not sure I believe the deduplication benefits exist much within a file, as files that change in that way are usually small. Happy t be corrected there. So larger block size does not hinder the benefits over time.

@ts678
Copy link
Collaborator

ts678 commented Apr 26, 2023

There are very few things that can't change, but blocksize is one of them because blocksets are expressed in terms of blocks of
fixed size, using a zero-based index. This is easy to convert with a byte offset in the file. If I change existing config then backup:

You have attempted to change the parameter "blocksize" from "102400" to "1048576", which is not supported. Please configure a new clean backup if you want to change the parameter.

The above message suggests that you can change at will before backup exists, but it would still be best to verify that all works. Before commenting, I did test an export without a specified blocksize, and it didn't bother exporting the blocksize, meaning an import to Duplicati after this change, followed by a fresh backup, would probably be OK but at the newer default, as is desired.

Even at default, a backup seems to save blocksize in job's database Configuration table, and also in manifest in destination files which is probably necessary because one needs to be able to recreate the job database from destination files in case of its loss.
Because of the above, I don't think there's much risk of breaking all existing backups, but I like any surprise to be found earlier.

We did a preview before the Canary the last time, so if that trend continues then some people can test things before a Canary.

I support the sacrifice of better deduplication in favor of non-abysmal performance, but I don't think there are stats collected...
Basically I think this PR is the way to go, and am just asking about test. I'm not the maintainer, just commenting while waiting.

After initial backup is too late to change the blocksize, as all the blocks are made and have hashes stored, so can't be changed. Scanning the entire source area for size is possible, but likely too slow. Ordinarily file scanning and block reading run in parallel.

@mr-russ
Copy link
Contributor Author

mr-russ commented May 17, 2023

So I spent more time digging into this. Even though the code says you can't change the block size, it really means you can't make the block size smaller. Blocksize is used to create byte buffers in the splitting code. So smaller will cause overflows.

The assertion that it's fixed size in backup files is incorrect as if you backup files that aren't an exact divisor of block size it still succeeds without storing how big the block is.

So blocksize changes are possible for a job if you make it bigger. It will mean you lose most of the deduplication benefit during the changeover period.

The blocksize setting is stored as part of the job configuration in all cases. So job config needs to be lost before an issues could appear. I've not looked specifically at restore however it becomes difficult to complete a restore on a different computer if you have to remember the block size of your original backup.

Other than backup/restore existing job with previous default and a successful backup here, which work, what other testing is required to pass the barrier to be considered good enough to merge?

@ts678
Copy link
Collaborator

ts678 commented May 17, 2023

So I spent more time digging into this.

Digging is good. Make sure you look at all the code that uses blocksize, including all the operations, and all the programs.

Blocksize is used to create byte buffers in the splitting code. So smaller will cause overflows.

Generalized conclusion based on one usage isn't convincingly extensible to everywhere, and I have serious doubt it works.

Case in point:

CheckingErrorsForIssue1400 and FoundIssue1400Error test case, analysis, and proposal #3868 concluded:

The heart of the flaw is that block sizes are inferred from an assumption that only the last block is short

The assertion that it's fixed size in backup files is incorrect as if you backup files that aren't an exact divisor of block size it still succeeds without storing how big the block is.

Obviously it has to deal with source files that are not an exact multiple of blocksize. It stores short end block -- and its size.
Look in the BlocksetEntry table and refs to Block table, Every block gets a size there, and many of them are at full blocksize.

The current BlocksetEntry table scheme which I just published in the forum shows how very simple math is possible to map between an Index there and an offset in a source file. This is true for backup, but also restore, which additionally tries to use
locally available blocks instead of downloading them. This probably relies on similar easy mapping. Restore has to still work.
Also look at how restore downloads dblock files and drops blocks into restored files as needed. Fixed size helps that, I think.

It will mean you lose most of the deduplication benefit during the changeover period.

How do you see two blocksize settings coexisting in the same backup, and for how long? What finally ends the dual setup?
Deduplication loss kicks in on any file changes, I think. It must be backed up again as a completely new set of blocks, right?
Do you think it possible for a file to use different sizes throughout, or is it just that some use the old while others the new?
First leads to variable block sizes, which some backups can do, but not Duplicati now (or so I believe). Latter seems easier...

I'd note that this is getting far more ambitious with some sort of mixed-blocksize plan than I was originally worrying about.
Original worry was mostly on tripping some protection against blocksize change. Actually allowing them seems a huge test.

@Jojo-1000
Copy link
Contributor

I tested the default change and it works as I would expect.

Existing backups with default 100kB

  • Keep their existing blocksize, also after database recreate
  • Setting blocksize manually to the new default produces an error
  • Exporting the configuration and importing works with existing backend files after database recreate
  • Exporting the configuration and importing without existing backend files uses the new default blocksize

Existing backups with explicit blocksize set

  • Work as before
  • Exporting and importing keeps blocksize the same

New backups with default blocksize

  • Use blocksize of 1MB
  • Exporting and importing in master version changes blocksize back to 100kB without destination files

From my point of view, this can be merged for a canary build. I was also thinking about putting the blocksize option under volume size instead of hidden in advanced options, to make the user more aware that this should be chosen carefully (maybe disable the field for existing backups).

@ts678
Copy link
Collaborator

ts678 commented Jun 21, 2023

Thanks for the testing! The good news is that apparently it respects the blocksize in an existing backup unless an explicit is given, (which probably means we should be careful when somebody asks what blocksize their backup has if it's not explicitly in options).

I don't know the code (but I'm tempted to go looking), but I see blocksize in the Configuration table, and it's also in manifest file which I think is in every Destination file. That file is probably how a DB recreate figures out what blocksize the backup is using.

EDIT:

public static void UpdateOptionsFromManifest(string compressor, string file, Options options)
{
using (var stream = new System.IO.FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Read))
using (var c = LoadCompressor(compressor, stream, options))
using (var s = c.OpenRead(MANIFEST_FILENAME))
using (var fs = new StreamReader(s, ENCODING))
{
var d = JsonConvert.DeserializeObject<ManifestData>(fs.ReadToEnd());
if (d.Version > ManifestData.VERSION)
throw new InvalidManifestException("Version", d.Version.ToString(), ManifestData.VERSION.ToString());
string n;
if (!options.RawOptions.TryGetValue("blocksize", out n) || string.IsNullOrEmpty(n))
options.RawOptions["blocksize"] = d.Blocksize + "b";

if (!hasUpdatedOptions && (!updating || autoDetectBlockSize))
{
VolumeReaderBase.UpdateOptionsFromManifest(parsed.CompressionModule, tmpfile, m_options);
hasUpdatedOptions = true;
// Recompute the cached sizes
blocksize = m_options.Blocksize;
hashes_pr_block = blocksize / m_options.BlockhashSize;
}

internal static void UpdateOptionsFromDb(LocalDatabase db, Options options, System.Data.IDbTransaction transaction = null)
{
string n = null;
var opts = db.GetDbOptions(transaction);
if(opts.ContainsKey("blocksize") && (!options.RawOptions.TryGetValue("blocksize", out n) || string.IsNullOrEmpty(n)))
options.RawOptions["blocksize"] = opts["blocksize"] + "b";

// Check the database integrity
Utility.UpdateOptionsFromDb(m_database, m_options);
Utility.VerifyParameters(m_database, m_options);

@ts678
Copy link
Collaborator

ts678 commented Jun 21, 2023

VerifyParameters is where the error on blocksize would have been made, It looks like it will only compare explicit options.
Ones that are not explicit are left the way they were after being read from the database Configuration (discussed above).
That seems good, and seeing UpdateOptionsFromDb lots of places in code makes me feel good about getting that right.
Everything looks like it adjusts according to the Configuration table, so getting that right should make everything follow.

Blocksize is used to create byte buffers in the splitting code. So smaller will cause overflows.

https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Main/Operation/Backup/StreamBlockSplitter.cs
is possibly what this refers to, but overflows sounds like a buffer overflow, and those are sized by blocksize, so I guess
I'm not sure exactly what thing is going to overflow and why. I guess it could be tested, but it could also be a false alarm.

On the subject of testing, I think this PR came in while unit tests were broken due to a bad certificate. How to test again?
I suspect that the manual tests did more than unit tests do, but might as well run whatever the standard tests can test...

Other than that, I think this looks Canary-ready, but further before the next Beta might still be a better Canary to target.

@duplicatibot
Copy link

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/restoring-a-large-db/17189/11

@kenkendk
Copy link
Member

I am in favor of this change, and can perhaps add some context.

The reason why you cannot change the block size on existing backups is that the hash is calculated from the block.
If you change the block size, (almost) all hashes will change and there will be no matches.
It is not as such a problem, but it would be unexpected that all data is suddenly stored twice.

Besides this, the restore and other logic uses the block size to calculate where things should go, and this will not work if there are two different block sizes. This can be fixed, but it requires quite a lot of testing and additional logic. I am not convinced this is something that is highly needed.

Because of the issues with changing block size, the size being used will be recorded in every remote file and checked before the file can be used. If you do not supply a block size override, it will simply read the value from either the database or the remote files and continue with this size. This is "by design" and I expect this to work with the current codebase.

If you are working with +1 TiB backups, you should certainly use a larger block size (and awesome measurements btw!). The downside for increasing the block size is less "reuse" in changed files. But this only impacts a few applications, such as databases.

For most "large" files, the content is compressed, meaning that even small edits will result in a completely new file (and least when looking at fixed-offset blocks). This holds for images, zip files, video files, Office documents, etc.

For files that are appended to (such as logs, diaries, etc) this would have an impact on the last block being included slightly more often than it would before. These files are expected to be smaller, and things like logs compress really well anyway.

For applications such as database, there will usually be a page-structure where fragments of the database file is updated. With bigger fragments, a change will be detected more often, and especially pages that span 2 blocks will cause some jitter.

I think administrators make backups of databases are more likely to understand the tradeoffs than other users, and I see very little impact for other users.

@kenkendk
Copy link
Member

The failing test are because the test code assume a different block size (I think).
The doc update is easy enough.

As for getting faster db support, a quick-fix could be to use something like DuckDB.
But a real fix is IMO to switch to a key-value store for the local data, rather than rely on a relational database.

@ts678
Copy link
Collaborator

ts678 commented Mar 18, 2024

Maybe some of this was to my comments on the issue rather than the PR (which would probably have been a better place).

I'm glad you know DuckDB, however IIRC its speed improvements are aimed at specific use cases that may not fit Duplicati.
Duplicati sometimes has extremely long running queries, but sometimes extremely repetitious short ones e.g. for each block.
One database performance person referred to Duplicati as chatty. DuckDB is sometimes favored for OLAP, SQLite for OLTP.

https://forum.duplicati.com/t/is-there-an-er-diagram-and-data-dictionary/14204/5?u=ts678

I had a look at the C# code last night and it’s going the way that all projects go in this position… everyone is writing their own query, hitting the database for every little piece of information. The result is a very chatty database interaction which kills performance and litters the code with snippets of SQL. I suggest this be reviewed as we understand how the database is to be used.

What's under the database may matter too, e.g. how badly does the hard drive slow things if an SSD is not what was used?
These are great long-term questions, and might limit the interesting SQLite usage like using multiple database connections.
We've been playing with increased SQLite cache, and there's even an undocumented environment variable to add pragmas:

// set custom Sqlite options
var opts = Environment.GetEnvironmentVariable("CUSTOMSQLITEOPTIONS_DUPLICATI");

Still, the near-term aid probably comes from bigger blocksize, or at least that's about all I have to offer people on the forum.

@kenkendk kenkendk merged commit 7dfa19d into duplicati:master Apr 21, 2024
0 of 10 checks passed
@mr-russ mr-russ deleted the blockdefault branch April 22, 2024 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants