Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

casync decompresses x1.5 faster than borg on same config (and other benchmarks) #7674

Open
safinaskar opened this issue Jun 24, 2023 · 32 comments

Comments

@safinaskar
Copy link

safinaskar commented Jun 24, 2023

UPD 2023-06-26 14:30 UTC: this is useless benchmark, see #7674 (comment) . I did proper benchmark here: #7674 (comment) .

Hi. I did many speed comparisons. Here are results:

  • casync ( https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html ) (which is inspired by many projects, including borg) decompresses particular realistic data (12 snapshots of 10 GiB VM) x1.5 faster than borg 2.0.0b5 on same chunker params and same compression level (compression ratio is same). But compression speed for borg is 2.3x better
  • borg --chunker-params fixed has x2.5 slower compression speed than my own trivial single-threaded Rust program for data deduplication on same settings/algorithms (compression ratio is same). Decompression for borg is x4.5 slower
  • borg can achieve nearly same compression ratio as zpaq's default settings, while being x2.9 faster in compression. Decompression speed is same. (Algorithms are different)
  • casync can achieve nearly same compression ratio as zpaq's default settings, while being x1.6 faster in decompression. Compression speed is same. (Algorithms are different)

"Compression" above means "all steps needed for putting given file (VM image) to deduplicator's storage". "Decompression" means opposite action, i. e. extracting from such storage. And everywhere I say borg I mean borg2.

Now let me give details. I'm writing "something like docker, but for VMs" for Linux. And I need deduplicating storage for this, i. e. something like what borg and casync do. So I did benchmark of various deduplicating solutions (and wrote my own trivial single-threaded Rust program). Test data is sequence of 12 snapshots of same VM. The zeroth snapshot (numbered 00) is VM created after fresh Debian install. Snapshot 01 is the same machine after execution of command echo deb http://deb.debian.org/debian sid main > /etc/apt/sources.list. Snapshot 02 is snapshot 01 after command apt-get update && apt-get install -y debootstrap && debootstrap sid /debian http://deb.debian.org/debian && : > /OLD && : > /debian/NEW && echo systemctl switch-root /debian /bin/bash > /root/.bash_history && apt-get install -y systemd, etc.

Of course, all this snapshots are very similar, and there are a lot of redundancy here.

Test was performed so: I created fresh deduplicating storage, for example, by borg2 rcreate .... Then I stored snapshot 00 to it using command like borg2 create .... Then 01, etc. Then I extracted them back one by one. I recorded time of each storing and extracting. And size of resulting storage.

The programs under test are these:

  • No deduplicating at all ("control group" of this experiment). Just compress each snapshot separately using zstd -4 -T0 (this command uses all available cores)
  • borg2 2.0.0b5. Both --chunker-params fixed and --chunker-params buzhash was tested. borg is single-threaded, according to docs
  • casync 2+20201210-1 (version as reported by debian). casync doesn't consider index files .caibx as part of its storage. I. e. raw chunks only counted from casync's point of view. But (for fair comparison with borg) I counted them as part of storage. casync uses same chunking algorithm as borg2, i. e. buzhash. As it seems from docs, casync is single-threaded
  • zpaq 7.15+repack-1 (version as reported by debian) on default settings. zpaq stores everything to one file. Thus deleting arbitrary old data is not supported. Still, deduplication is supported. zpaq uses all available cores
  • My own trivial deduplication solution written in Rust. The program has 125 lines (but heavily uses libraries from crates.io ). It simply cuts file in same-size chunks (i. e. it is analog of borg2 --chunker-params fixed), then deduplicates them and compresses using zstd. Similarly to casync the program manages chunks storage, but doesn't manage index files (i. e. user should manage index files somehow on its own, similarly to casync). But for fair comparison with other solutions index file sizes was counted as part of storage. The program is absolutely trivial and does bare minimum. Also it cheats: it compares chunks using weak murmur3 hash instead of some cryptographic hash (this is possible reason for good speed). But this should not affect decompression speed, because there is no reason to compute hashes when decompressing. The program is single-threaded. Unfortunately, there is no analog of casync's casync gc (garbage collector) command, so proper deletion of data is not supported, i. e. chunk storage is effectively append-only (of course, this feature can be added if needed)

The systems for testing was these:

  • DigitalOcean (DO) VPS, 2 cores. Linux. Debian bookworm. (I don't remember which file system I used for testing)
  • AWS VPS, 96 cores. tmpfs. Linux. Debian bullseye

The same versions of software used for both tests (expect for zstd, i. e. "control group", I'm sorry about that). All software was installed from debian repos (except for my solution, of course).

Okay, so here are some particular facts from this experiment.

Let's compare casync's default chunker settings with exact same borg's chunker settings. Here is result (DO):

{"method":"Casync { chunker_params: \"16384:65536:262144\" }","size":2288245859,"compression":[25018278,14123602,20728147,75336765,74989789,71274134,75145956,73516900,73657489,71545543,72030643,71385677],"decompression":[12319323,9242255,11615280,24205058,27393626,27160242,30992227,32594951,32978333,33110100,32684700,32696767],"total_compression":718752927,"total_decompression":306992867}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\" }","size":2149378514,"compression":[17679646,8448045,14671946,21504361,23543654,20046573,25479210,24576887,25604774,23566982,24462718,22118519],"decompression":[12550832,12346506,13107262,49148793,48386676,45512108,49184202,50710626,53943566,54338980,56295807,54046982],"total_compression":251703322,"total_decompression":499572346}

As well as I understand from casync's sources, casync uses 48 as its window size, so I set same window size for borg. So, everything is same, chunker is same, compression is same (level 3 is default for casync make --compression=zstd). So, we (predictably) get nearly same repo size. But compression speed is way bigger for borg. And decompression speed is way bigger for casync.

Now let's compare borg2 --chunker-params=fixed with my dedupper with same chunk size and same compression (AWS):

{"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,1\" }","size":3045630161,"compression":[10498009,5765099,11638767,11808097,11682813,9746850,14360024,12748675,13860854,12773828,13976675,13142882],"decompression":[8436011,8436552,8773607,34001769,34560878,34688943,35455363,35270733,35649059,35656812,35832813,35752490],"total_compression":142002579,"total_decompression":342515034}
{"method":"MyChunker { block_bytes: 4194304, zstd: 1, hash: \"murmur3\" }","size":3045403863,"compression":[5198602,836222,5851959,7503258,5633901,3456217,6540237,4397998,4985617,3890258,4718666,3864603],"decompression":[2717125,2661899,3047367,6441092,6868591,6854122,7567294,7639675,7984931,7982271,8272383,8156871],"total_compression":56877544,"total_decompression":76193627}
{"method":"MyChunker { block_bytes: 4194304, zstd: 1, hash: \"sha256\" }","size":3045797079,"compression":[10820321,6458880,11505570,35665701,33806750,31593533,34644788,32626207,33124389,32051083,32847354,32049456],"decompression":[2677887,2655985,3093219,6251224,6857964,6813882,7581230,7604525,7907221,7911569,8100640,8103418],"total_compression":327194037,"total_decompression":75558768}

So, everything is same. And predictably we got nearly same compressed size. But my dedupper is way faster than borg both in compression and decompression (if we cheat using murmur3 for chunk comparision). Compression is x2.5 faster. And decompression is x4.5 faster. I don't compute hashes during decompression (I think there is no reason to do so), so even in sha256 mode decompression is still x4.5 faster than borg. But with sha256 compression becomes x2.3 slower than borg (it seems I chose slow sha256 implementation from crates.io or maybe the reason is overflow-checks = true).

Both casync and borg can achieve compression level similar to zpaq's default settings (if properly configured). But in such case casync and borg become better in compression speed or in decompression speed (DO):

{"method":"Zpaq","size":2020550898,"compression":[66510942,13447238,47685405,80578869,80038181,59823872,80654039,65066410,71060346,61404565,64172466,61875377],"decompression":[18750110,20467299,29366756,36280852,41599404,41990291,48276667,51680639,54273853,54363762,55445757,55403322],"total_compression":752317716,"total_decompression":507898716}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,4\" }","size":2145319615,"compression":[20378384,8886346,16710992,23303757,24969724,20268377,26531417,25136239,26505187,24139859,25912861,22098249],"decompression":[12205145,12290434,12885718,43323609,44173501,43728450,44725463,46298867,46943885,46165601,48665056,48574784],"total_compression":264841397,"total_decompression":449980518}
{"method":"Casync { chunker_params: \"16384:65536:262144\" }","size":2288245859,"compression":[25018278,14123602,20728147,75336765,74989789,71274134,75145956,73516900,73657489,71545543,72030643,71385677],"decompression":[12319323,9242255,11615280,24205058,27393626,27160242,30992227,32594951,32978333,33110100,32684700,32696767],"total_compression":718752927,"total_decompression":306992867}

Conclusions:

  • casync compresses slower than borg for same (buzhash) parameters. So, something is going wrong and casync should be fixed
  • borg decompresses slower than borg for same (buzhash) parameters. So, again, something is going wrong and borg should be fixed
  • My dedupper decompresses 4.5 times faster than borg. So, again, something is really wrong with borg decompression
  • zpaq's default settings are really bad for my use case, and casync and borg can easily beat them

If I had blog, I would post everything there. But I don't have a blog, so I post everything here, to borg bug tracker. I hope this report will be useful to borg devs. If you want to write me, but don't want to pollute borg bug tracker, just write me directly to safinaskar@gmail.com .

Okay, now let me show you code and raw data.

Code for my dedupper:

Cargo.toml
[package]
name = "my-chunker"
version = "0.1.0"
edition = "2021"

[profile.release]
overflow-checks = true

[dependencies]
fastmurmur3 = "0.2.0"
hex = "0.4.3"
libc = "0.2.146"
ring = "0.16.20"
zstd = "0.12.3"
src/main.rs
#![feature(fs_try_exists)]
#![feature(file_create_new)]

use std::io::Read;
use std::io::Write;

struct Killed;

// "kill" just means "completely drop this value". I. e. we drop it and then shadow it to make sure we will not be able to access it even if it is Copy. I. e. when you see "kill!", assume it is drop
macro_rules! kill {
    ($x:ident) => {
        #[allow(dropping_references)]
        #[allow(dropping_copy_types)]
        std::mem::drop($x);
        #[allow(unused_variables)]
        let $x = Killed;
    }
}

#[derive(Clone, Copy)]
enum Hash {
    MurMur3,
    Sha256,
}

fn calc_hash(hash_type: Hash, data: &[u8]) -> Vec<u8> {
    return match hash_type {
        Hash::MurMur3 => fastmurmur3::hash(data).to_le_bytes().to_vec(),
        Hash::Sha256 => ring::digest::digest(&ring::digest::SHA256, data).as_ref().to_vec(),
    };
}

// "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()"
fn main() {
    if std::env::args().collect::<Vec<_>>()[1] == "add" {
        let [_, _, ref o_storage, ref o_block_bytes, ref o_zstd, ref o_hash] = *std::env::args().collect::<Vec<_>>() else { panic!("Usage"); };
        let block_bytes: usize = o_block_bytes.strip_prefix("--block-bytes=").unwrap().parse().unwrap();
        kill!(o_block_bytes);
        if !(4096 <= block_bytes && block_bytes <= 10_000_000_000) {
            panic!();
        }
        let zstd_level: i32 = o_zstd.strip_prefix("--zstd=").unwrap().parse().unwrap();
        kill!(o_zstd);
        let hash_type = match &**o_hash {
            "--strong-hash=murmur3" => Hash::MurMur3,
            "--strong-hash=sha256" => Hash::Sha256,
            _ => panic!(),
        };
        kill!(o_hash);
        let storage = o_storage.strip_prefix("--chunks=").unwrap();
        kill!(o_storage);

        let mut input = std::io::stdin().lock();
        let mut stdout = std::io::stdout().lock();
        let mut buf = vec![0u8; block_bytes];
        kill!(block_bytes);
        let mut hashs = vec![];
        loop {
            let buf_size = input.read(&mut buf).unwrap();
            if buf_size == 0 {
                break;
            }
            let buf = &buf[..buf_size];
            kill!(buf_size);
            let hash = calc_hash(hash_type, &buf);
            hashs.extend_from_slice(&hash);
            let hash_str = hex::encode(&hash);
            kill!(hash);
            if !std::fs::try_exists(format!("{storage}/{hash_str}")).unwrap() {
                let compressed = zstd::bulk::compress(buf, zstd_level).unwrap();
                let mut file = std::fs::File::create_new(format!("{storage}/{hash_str}")).unwrap();
                [()] = [file.write_all(&u64::try_from(buf.len()).unwrap().to_le_bytes()).unwrap()];
                kill!(buf);
                [()] = [file.write_all(&compressed).unwrap()];
            }
        }
        [()] = [stdout.write_all(&hashs).unwrap()];
    } else if std::env::args().collect::<Vec<_>>()[1] == "extract" {
        let [_, _, ref o_storage, ref o_hash] = *std::env::args().collect::<Vec<_>>() else { panic!("Usage"); };
        let hash_type = match &**o_hash {
            "--strong-hash=murmur3" => Hash::MurMur3,
            "--strong-hash=sha256" => Hash::Sha256,
            _ => panic!(),
        };
        kill!(o_hash);
        let storage = o_storage.strip_prefix("--chunks=").unwrap();
        kill!(o_storage);
        let mut input = std::io::stdin().lock();
        let mut output = std::io::stdout().lock();
        loop {
            let mut hash_buf = vec![0u8; match hash_type {
                Hash::MurMur3 => 128 / 8,
                Hash::Sha256 => 256 / 8,
            }];
            {
                let have_read = input.read(&mut hash_buf).unwrap();
                if have_read == 0 {
                    break;
                } else if have_read == hash_buf.len() {
                    // OK
                } else {
                    panic!();
                }
            }
            let hash_str = hex::encode(&hash_buf);
            let mut file = std::fs::File::open(format!("{storage}/{hash_str}")).unwrap();
            let mut size = [0u8; 8];
            [()] = [file.read_exact(&mut size).unwrap()];
            let size = usize::try_from(u64::from_le_bytes(size)).unwrap();
            let mut data = vec![];
            let _: usize = file.read_to_end(&mut data).unwrap();
            kill!(file);
            {
                let decompressed = zstd::bulk::decompress(&data, size).unwrap();
                kill!(data);
                assert_eq!(decompressed.len(), size);
                kill!(size);
                [()] = [output.write_all(&decompressed).unwrap()];
            }
        }
    } else {
        panic!("Usage");
    }
}

Now program for executing benchmark

Cargo.toml
[package]
name = "rust"
version = "0.1.0"
edition = "2021"

[dependencies]
regex = "1.8.4"
serde_json = { version = "1.0.79", features = ["preserve_order"] }
serde = { version = "1.0.136", features = ["derive"] }
src/main.rs
#![feature(exit_status_error)]

// "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()"

fn run(s: &str) {
    [()] = [std::process::Command::new("bash").arg("-ec").arg(s).status().unwrap().exit_ok().unwrap()];
}

fn main() {
    const REPO: &str = "/home/user/dedup-bench/fs/repo";
    const CMP_ME: &str = "/home/user/dedup-bench/fs/cmp-me";
    #[derive(Debug)]
    enum Method {
        ZStd,
        Borg { chunker_params: String, compression: String, },
        Casync { chunker_params: String, },
        Zpaq,
        MyChunker { block_bytes: usize, zstd: i32, hash: String },
    }
    let range = 0..=11;
    // let range = 0..=2;
    // let range = 0..=5;
    run(&format!("rm -rf {REPO}"));
    run(&format!("rm -rf {CMP_ME}"));
    #[derive(Debug)]
    struct Run {
        method: Method,
        size: usize,
        compression: Vec<std::time::Duration>,
        decompression: Vec<std::time::Duration>,
        total_compression: std::time::Duration,
        total_decompression: std::time::Duration,
    }
    let mut runs = vec![];
    let mut methods = vec![];
    {
        methods.push(Method::ZStd);
        for chunker_params in [
            "buzhash,19,23,21,4095", // borg default
            "buzhash,10,23,16,4095", // borg alternative
            "buzhash,14,18,16,4095", // casync default
            "buzhash,19,23,21,48", // borg default (casync's window size)
            "buzhash,10,23,16,48", // borg alternative
            "buzhash,14,18,16,48", // casync default
            "fixed,4096,512",
            "fixed,4096",
            "fixed,4194304,512",
            "fixed,4194304",
        ] {
            for compression in [
                "none",
                "lz4",
                "zstd,1",
                "zstd,2",
                "zstd,3", // casync default
                "zstd,4",
            ] {
                methods.push(Method::Borg { chunker_params: chunker_params.to_owned(), compression: compression.to_owned() });
            }
        }
        for chunker_params in [
            format!("{}:{}:{}", 1 << 19, 1 << 21, 1 << 23), // borg default
            format!("{}:{}:{}", 1 << 10, 1 << 16, 1 << 23), // borg alternative
            format!("{}:{}:{}", 1 << 14, 1 << 16, 1 << 18), // casync default
        ] {
            methods.push(Method::Casync { chunker_params: chunker_params.to_owned(), });
        }
        methods.push(Method::Zpaq);
        for block_bytes in [1 << 15, 1 << 16, 1 << 17, 1 << 18, 1 << 19, 1 << 20, 1 << 21, 1 << 22 /*4194304*/, 1 << 23, 1 << 24, 1 << 25, 1 << 26, 1 << 27, 1 << 28, 1 << 29, 1 << 30] {
            //for zstd in [-22, -20, -10, -5, -1, 1, 2, 3, 4, 5, 6] {
            for zstd in [1] {
                //for hash in ["murmur3", "sha256"] {
                for hash in ["murmur3"] {
                    methods.push(Method::MyChunker { block_bytes, zstd, hash: hash.to_owned() });
                }
            }
        }
    }
    for method in methods {
        run(&format!("mkdir {REPO}"));
        run(&format!("mkdir {CMP_ME}"));

        // Initializing
        match method {
            Method::ZStd => {},
            Method::Borg { .. } => run(&format!("borg2 rcreate --encryption=none --repo {REPO}")),
            Method::Casync { .. } => run(&format!("mkdir {REPO}/index {REPO}/storage.castr")),
            Method::Zpaq => {},
            Method::MyChunker { .. } => run(&format!("mkdir {REPO}/chunks {REPO}/index")),
        }

        let mut compression = vec![];

        for i in range.clone() {
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            println!("**** {method:?} {i} compression...");
            let now = std::time::Instant::now();
            match &method {
                Method::ZStd => run(&format!("zstd -4 -T0 < /home/user/dedup-bench/sto/{i:02} > {REPO}/{i:02}.zst")),
                Method::Borg { chunker_params, compression } => run(&format!("cd /home/user/dedup-bench/sto; borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}")),
                Method::Casync { chunker_params } => run(&format!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02} > /dev/null")),
                Method::Zpaq => run(&format!("cd /home/user/dedup-bench/sto; zpaq add {REPO}/repo.zpaq {i:02}")),
                Method::MyChunker { block_bytes, zstd, hash } => run(&format!("/my-chunker add --chunks={REPO}/chunks --block-bytes={block_bytes} --zstd={zstd} --strong-hash={hash} < /home/user/dedup-bench/sto/{i:02} > {REPO}/index/{i:02}")),
            }
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            compression.push(now.elapsed());
            println!("**** {method:?} {i} compression: {:?}", now.elapsed());
        }

        let output = std::process::Command::new("bash").arg("-ec").arg(format!("du --bytes --apparent-size -s {REPO}")).output().unwrap();
        [()] = [output.status.exit_ok().unwrap()];
        assert_eq!(output.stderr.len(), 0);
        let size: usize;
        {
            let x = String::from_utf8(output.stdout).unwrap();
            let caps = regex::Regex::new(r##"^(?<s>[0-9]+)\t/[-/a-z]+\n$"##).unwrap().captures_iter(&x).collect::<Vec<_>>();
            assert!(caps.len() == 1);
            size = caps[0]["s"].parse().unwrap();
        }
        {
            let len = size
                .to_string()
                .chars()
                .collect::<Vec<_>>()
                .rchunks(3)
                .map(|x|x.into_iter().collect::<String>())
                .rev()
                .collect::<Vec<_>>()
                .join("_");
            println!("size: {len}");
        }

        let mut decompression = vec![];
        for i in range.clone() {
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            println!("**** {method:?} {i} decompression...");
            let now = std::time::Instant::now();
            match method {
                Method::ZStd => run(&format!("zstd -d -T0 < {REPO}/{i:02}.zst > {CMP_ME}/data")),
                Method::Borg { .. } => run(&format!("cd {CMP_ME}; borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data")),
                Method::Casync { .. } => run(&format!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx {CMP_ME}/data")),
                Method::Zpaq => run(&format!("cd {CMP_ME}; zpaq extract {REPO}/repo.zpaq {i:02}; mv -i {i:02} data")),
                Method::MyChunker { block_bytes: _, zstd: _, ref hash } => run(&format!("/my-chunker extract --chunks={REPO}/chunks --strong-hash={hash} < {REPO}/index/{i:02} > {CMP_ME}/data")),
            }
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            decompression.push(now.elapsed());
            println!("**** {method:?} {i} decompression: {:?}", now.elapsed());
            run(&format!("cmp /home/user/dedup-bench/sto/{i:02} {CMP_ME}/data"));
            run(&format!("rm {CMP_ME}/data"));
        }

        run(&format!("rm -r {REPO}"));
        run(&format!("rm -r {CMP_ME}"));

        let total_compression   = compression.iter().sum();
        let total_decompression = decompression.iter().sum();
        runs.push(Run { method, size, compression, decompression, total_compression, total_decompression });
    }
    for i in &runs {
        println!("{:?}", i);
    }
    #[derive(Debug, serde::Serialize)]
    struct JsonRun {
        method: String,
        size: usize,
        compression: Vec<u128>,
        decompression: Vec<u128>,
        total_compression: u128,
        total_decompression: u128,
    }
    for i in runs {
        let Run { method, size, compression, decompression, total_compression, total_decompression } = i;
        println!("{}", serde_json::to_string(&JsonRun {
            method: format!("{:?}", method),
            size,
            compression: compression.into_iter().map(|x|x.as_micros()).collect(),
            decompression: decompression.into_iter().map(|x|x.as_micros()).collect(),
            total_compression: total_compression.as_micros(),
            total_decompression: total_decompression.as_micros(),
        }).unwrap());
    }
}

Now snapshots. All snapshots are raw GPT disk images with single ext4 partition. Snapshot 00 is 2 GiB fresh install of debian sid from https://snapshot.debian.org/archive/debian/20220917T224346Z . Next snapshots are produced by executing the following sequence of commands:

01 echo deb http://deb.debian.org/debian sid main > /etc/apt/sources.list
02 apt-get update && apt-get install -y debootstrap && debootstrap sid /debian http://deb.debian.org/debian && : > /OLD && : > /debian/NEW && echo systemctl switch-root /debian /bin/bash > /root/.bash_history && apt-get install -y systemd
03 (resize to 10G)
04 apt-get install -y git build-essential
05 echo deb-src http://deb.debian.org/debian sid main >> /etc/apt/sources.list
06 apt-get update && apt-get build-dep -y systemd
07 git clone https://github.com/systemd/systemd
08 cd systemd && git checkout f717d7a40a696b351415976f22a4f498c401de41 && meson setup build -Dmode=release && ninja -C build
09 :> /var/lib/dpkg/info/systemd.prerm
10 apt-get purge -y systemd systemd-sysv && cd systemd && make install
11 echo alive

I. e. snapshot 04 is snapshot 03 plus apt-get install -y git build-essential. These snapshots naturally appeared when I attempted to reproduce particular systemd bug, so data is 100% realistic.

The snapshot contain nothing confidential, so if you want, I can send them to you, just write me.

Now full benchmark results. (JSON is in the end.)

DO: https://paste.gg/p/anonymous/063883027e9848d8b7f53a046c4bd3cf
AWS: https://paste.gg/p/anonymous/bd6d25d5906f4c0fb94b78af605fb475

(Unfortunately, I used du for calculating storage size, but, as well as I remember, du utils from DO and AWS seem to produce slightly different results, i. e. these two du's use different algorithms)

Assume all this code as public domain. If you want me to release any of my mentioned software as proper free software project (say, to crates.io), just write me.

Exact commands for invoking borg and other programs under test:

        // Initializing
        match method {
            Method::ZStd => {},
            Method::Borg { .. } => run(&format!("borg2 rcreate --encryption=none --repo {REPO}")),
            Method::Casync { .. } => run(&format!("mkdir {REPO}/index {REPO}/storage.castr")),
            Method::Zpaq => {},
            Method::MyChunker { .. } => run(&format!("mkdir {REPO}/chunks {REPO}/index")),
        }
...
        match &method {
                Method::ZStd => run(&format!("zstd -4 -T0 < /home/user/dedup-bench/sto/{i:02} > {REPO}/{i:02}.zst")),
                Method::Borg { chunker_params, compression } => run(&format!("cd /home/user/dedup-bench/sto; borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}")),
                Method::Casync { chunker_params } => run(&format!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02} > /dev/null")),
                Method::Zpaq => run(&format!("cd /home/user/dedup-bench/sto; zpaq add {REPO}/repo.zpaq {i:02}")),
                Method::MyChunker { block_bytes, zstd, hash } => run(&format!("/my-chunker add --chunks={REPO}/chunks --block-bytes={block_bytes} --zstd={zstd} --strong-hash={hash} < /home/user/dedup-bench/sto/{i:02} > {REPO}/index/{i:02}")),
            }
...
            match method {
                Method::ZStd => run(&format!("zstd -d -T0 < {REPO}/{i:02}.zst > {CMP_ME}/data")),
                Method::Borg { .. } => run(&format!("cd {CMP_ME}; borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data")),
                Method::Casync { .. } => run(&format!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx {CMP_ME}/data")),
                Method::Zpaq => run(&format!("cd {CMP_ME}; zpaq extract {REPO}/repo.zpaq {i:02}; mv -i {i:02} data")),
                Method::MyChunker { block_bytes: _, zstd: _, ref hash } => run(&format!("/my-chunker extract --chunks={REPO}/chunks --strong-hash={hash} < {REPO}/index/{i:02} > {CMP_ME}/data")),
            }

Is this a BUG / ISSUE report or a QUESTION?

I think borg is slower than needed. Possible bug

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 24, 2023

Thanks for the information and doing so much testing.

I see you used misc. VPSes - how can you know they provide consistent CPU and I/O performance, so your benchmark timings are comparable? Even on bare hardware, timings might fluctuate quite a bit due to other stuff the machine is doing in parallel.

Also: did these VPSes offer sha2 hw acceleration? This is quite importing when using sha256, which is otherwise quite slow. Without sha2 hw accel, blake2 is better (in case of borg).

You need to use a strong cryptographic hash - there is no point in using a fast weak hash if you have a higher risk of hash collisions.

At "extraction" time, one needs to assert chunkid_we_wanted == chunkid_hash(chunk_content_we_got) or a repo side attacker might give you a wrong chunk and not the one you wanted. for encrypted repos "hash" actually means a MAC here, computed with a secret key that a repo side attacker can't know. I initially thought I could get rid of that check by using AEAD crypto, but that's not the case.

Maybe for comparisons of misc tools, it would be better to just focus on one compression algorithm and level and have less result data to review. Of course that should be a fast one as we are not testing for compression performance, but rather for tool performance, so I'ld suggest lz4 or zstd,1.

@safinaskar
Copy link
Author

I did some more testing. And now I see that AWS's performance is completely unstable. Just now I added "00" to empty borg repo with settings chunker_params: "fixed,4194304", compression: "zstd,3" and this took 12.687978481s. Then (after 5 minutes) I did the same again, again to empty repo, and this took 23.077241485s. So, I'm sorry for bad benchmark. I probably will try to find more stable host and update results.

( @ThomasWaldmann, I see your questions, I will answer them later)

@safinaskar
Copy link
Author

I did better benchmark. Here is list of changes (see the first post #7674 (comment) for context):

  • Now I did everything on my laptop Librem Purism. No other programs was running on the laptop in parallel. Results seem stable. (But I still see some little noize. And yes, I'm not benchmark expert, I just want to share data I created for choosing dedupper solution for my purposes.) All tests done in recent Debian Linux sid running in docker. Source VM snapshots are on ext3. All output data (repos and results of extracting) are on tmpfs
  • I replaced sha256 implementation in my dedupper (but this new implementation is still slow)
  • I added blake2b and blake3 to my dedupper
  • I added optional possibility for verifying hashes in decompression in my dedupper (for fair comparison with borg)
  • Now everything runs in single thread (for fair comparison)
  • All programs (except for zpaq) now use zstd compression with level 3, as suggested by @ThomasWaldmann (I chose level 3, because it is hardcoded in casync)
  • Now I test 3 different arguments for --encryption parameter in borg: "none", "authenticated", "authenticated-blake2"

My dedupper:

Cargo.toml
[package]
name = "my-chunker"
version = "0.1.0"
edition = "2021"

[dependencies]
blake2b_simd = "1.0.1"
blake3 = "1.4.0"
fastmurmur3 = "0.2.0"
hex = "0.4.3"
libc = "0.2.146"
openssl = { version = "0.10.55", features = ["vendored"] }
zstd = "0.12.3"
src/main.rs
#![feature(fs_try_exists)]
#![feature(file_create_new)]

use std::io::Read;
use std::io::Write;

struct Killed;

// "kill" just means "completely drop this value". I. e. we drop it and then shadow it to make sure we will not be able to access it even if it is Copy. I. e. when you see "kill!", assume it is "drop"
macro_rules! kill {
    ($x:ident) => {
        #[allow(dropping_references)]
        #[allow(dropping_copy_types)]
        std::mem::drop($x);
        #[allow(unused_variables)]
        let $x = Killed;
    }
}

#[derive(Clone, Copy)]
enum Hash {
    MurMur3,
    Sha256,
    Blake3,
    Blake2b,
}

fn calc_hash(hash_type: Hash, data: &[u8]) -> Vec<u8> {
    return match hash_type {
        Hash::MurMur3 => fastmurmur3::hash(data).to_le_bytes().to_vec(),
        Hash::Sha256 => {
            openssl::sha::sha256(data).to_vec()
            /*use sha2::Digest;
            let mut hasher = sha2::Sha256::new();
            hasher.update(data);
            hasher.finalize().to_vec()*/
        },
        Hash::Blake3 => blake3::hash(data).as_bytes().to_vec(),
        Hash::Blake2b => blake2b_simd::blake2b(data).as_bytes().to_vec(),
    };
}

fn hash_name_to_enum(name: &str) -> Hash {
    return match name {
        "--strong-hash=murmur3" => Hash::MurMur3,
        "--strong-hash=sha256" => Hash::Sha256,
        "--strong-hash=blake3" => Hash::Blake3,
        "--strong-hash=blake2b" => Hash::Blake2b,
        _ => panic!(),
    };
}

// "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()"

fn main() {
    if std::env::args().collect::<Vec<_>>()[1] == "add" {
        let [_, _, ref o_storage, ref o_block_bytes, ref o_zstd, ref o_hash] = *std::env::args().collect::<Vec<_>>() else { panic!("Usage"); };
        let block_bytes: usize = o_block_bytes.strip_prefix("--block-bytes=").unwrap().parse().unwrap();
        kill!(o_block_bytes);
        if !(4096 <= block_bytes && block_bytes <= 10_000_000_000) {
            panic!();
        }
        let zstd_level: i32 = o_zstd.strip_prefix("--zstd=").unwrap().parse().unwrap();
        kill!(o_zstd);
        let hash_type = hash_name_to_enum(o_hash);
        kill!(o_hash);
        let storage = o_storage.strip_prefix("--chunks=").unwrap();
        kill!(o_storage);

        let mut input = std::io::stdin().lock();
        let mut stdout = std::io::stdout().lock();
        let mut buf = vec![0u8; block_bytes];
        kill!(block_bytes);
        let mut hashs = vec![];
        loop {
            let buf_size = input.read(&mut buf).unwrap();
            // buf_size can be lesser than block_bytes, even if we are not in EOF. We consider this normal (in other words, it is NOT guaranted that we will actually split file as equal-sized blocks)
            if buf_size == 0 {
                break;
            }
            let buf = &buf[..buf_size];
            kill!(buf_size);
            let hash = calc_hash(hash_type, &buf);
            hashs.extend_from_slice(&hash);
            let hash_str = hex::encode(&hash);
            kill!(hash);
            if !std::fs::try_exists(format!("{storage}/{hash_str}")).unwrap() {
                let compressed = zstd::bulk::compress(buf, zstd_level).unwrap();
                let mut file = std::fs::File::create_new(format!("{storage}/{hash_str}")).unwrap();
                [()] = [file.write_all(&u64::try_from(buf.len()).unwrap().to_le_bytes()).unwrap()];
                kill!(buf);
                [()] = [file.write_all(&compressed).unwrap()];
            }
        }
        [()] = [stdout.write_all(&hashs).unwrap()];
    } else if std::env::args().collect::<Vec<_>>()[1] == "extract" {
        let [_, _, ref o_storage, ref o_hash, ref o_check] = *std::env::args().collect::<Vec<_>>() else { panic!("Usage"); };
        let hash_type = hash_name_to_enum(o_hash);
        kill!(o_hash);
        let storage = o_storage.strip_prefix("--chunks=").unwrap();
        kill!(o_storage);
        let check = match &**o_check {
            "--check=true" => true,
            "--check=false" => false,
            _ => panic!(),
        };
        let mut input = std::io::stdin().lock();
        let mut output = std::io::stdout().lock();
        loop {
            let mut hash_buf = vec![0u8; match hash_type {
                Hash::MurMur3 => 128 / 8,
                Hash::Sha256 => 256 / 8,
                Hash::Blake3 => 256 / 8,
                Hash::Blake2b => 64,
            }];
            {
                let have_read = input.read(&mut hash_buf).unwrap();
                if have_read == 0 {
                    break;
                } else if have_read == hash_buf.len() {
                    // OK
                } else {
                    panic!();
                }
            }
            let hash_str = hex::encode(&hash_buf);
            let mut file = std::fs::File::open(format!("{storage}/{hash_str}")).unwrap();
            let mut size = [0u8; 8];
            [()] = [file.read_exact(&mut size).unwrap()];
            let size = usize::try_from(u64::from_le_bytes(size)).unwrap();
            let mut data = vec![];
            let _: usize = file.read_to_end(&mut data).unwrap();
            kill!(file);
            {
                let decompressed = zstd::bulk::decompress(&data, size).unwrap();
                kill!(data);
                assert_eq!(decompressed.len(), size);
                kill!(size);
                // I think we don't need to check hash here. So the check is optional (but we check decompressed size to make sure chunk is not truncated)
                if check {
                    assert_eq!(calc_hash(hash_type, &decompressed), hash_buf);
                }
                [()] = [output.write_all(&decompressed).unwrap()];
            }
        }
    } else {
        panic!("Usage");
    }
}

Program for executing benchmarks:

Cargo.toml
[package]
name = "rust"
version = "0.1.0"
edition = "2021"

[dependencies]
regex = "1.8.4"
serde_json = { version = "1.0.79", features = ["preserve_order"] }
serde = { version = "1.0.136", features = ["derive"] }
src/main.rs
#![feature(exit_status_error)]

// "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()"

fn run(s: &str) {
    [()] = [std::process::Command::new("bash").arg("-ec").arg(s).status().unwrap().exit_ok().unwrap()];
}

fn main() {
    const REPO: &str = "/home/user/dedup-bench/fs/repo";
    const CMP_ME: &str = "/home/user/dedup-bench/fs/cmp-me";
    #[derive(Debug)]
    enum Method {
        ZStd,
        Borg { chunker_params: String, compression: String, encryption: String, },
        Casync { chunker_params: String, },
        Zpaq,
        MyChunker { block_bytes: usize, zstd: i32, hash: String, check: bool },
    }
    let range = 0..=11;
    // let range = 0..=2;
    // let range = 0..=5;
    run(&format!("rm -rf {REPO}"));
    run(&format!("rm -rf {CMP_ME}"));
    #[derive(Debug)]
    struct Run {
        method: Method,
        size: usize,
        compression: Vec<std::time::Duration>,
        decompression: Vec<std::time::Duration>,
        total_compression: std::time::Duration,
        total_decompression: std::time::Duration,
    }
    let mut methods = vec![];
    {
        methods.push(Method::ZStd);
        for chunker_params in [
            "buzhash,19,23,21,4095", // borg default
            "buzhash,10,23,16,4095", // borg alternative
            "buzhash,14,18,16,4095", // casync default
            "buzhash,19,23,21,48", // borg default (casync's window size)
            "buzhash,10,23,16,48", // borg alternative
            "buzhash,14,18,16,48", // casync default
            "fixed,4194304",
        ] {
            for compression in [
                "zstd,3", // casync default
            ] {
                for encryption in [
                    "none",
                    "authenticated",
                    "authenticated-blake2",
                ] {
                    methods.push(Method::Borg { chunker_params: chunker_params.to_owned(), compression: compression.to_owned(), encryption: encryption.to_owned() });
                }
            }
        }
        for chunker_params in [
            format!("{}:{}:{}", 1 << 19, 1 << 21, 1 << 23), // borg default
            format!("{}:{}:{}", 1 << 10, 1 << 16, 1 << 23), // borg alternative
            format!("{}:{}:{}", 1 << 14, 1 << 16, 1 << 18), // casync default
        ] {
            methods.push(Method::Casync { chunker_params: chunker_params.to_owned(), });
        }
        methods.push(Method::Zpaq);
        for block_bytes in [4194304] {
            for zstd in [3] {
                for hash in ["murmur3", "sha256", "blake3", "blake2b"] {
                    for check in [false, true] {
                        methods.push(Method::MyChunker { block_bytes, zstd, hash: hash.to_owned(), check });
                    }
                }
            }
        }
    }
    for method in methods {
        run(&format!("mkdir {REPO}"));
        run(&format!("mkdir {CMP_ME}"));

        // Initializing
        match &method {
            Method::ZStd => {},
            Method::Borg { encryption, .. } => run(&format!("BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}")),
            Method::Casync { .. } => run(&format!("mkdir {REPO}/index {REPO}/storage.castr")),
            Method::Zpaq => {},
            Method::MyChunker { .. } => run(&format!("mkdir {REPO}/chunks {REPO}/index")),
        }

        let mut compression = vec![];

        for i in range.clone() {
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            println!("**** {method:?} {i} compression...");
            let now = std::time::Instant::now();
            match &method {
                Method::ZStd => run(&format!("zstd -3 -T1 < /home/user/dedup-bench/sto/{i:02} > {REPO}/{i:02}.zst")),
                Method::Borg { chunker_params, compression, encryption: _ } => run(&format!("cd /home/user/dedup-bench/sto; BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}")),
                Method::Casync { chunker_params } => run(&format!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02} > /dev/null")),
                Method::Zpaq => run(&format!("cd /home/user/dedup-bench/sto; zpaq add {REPO}/repo.zpaq {i:02} -threads 1")),
                Method::MyChunker { block_bytes, zstd, hash, check: _ } => run(&format!("/home/user/dedup-bench/my-chunker/target/release/my-chunker add --chunks={REPO}/chunks --block-bytes={block_bytes} --zstd={zstd} --strong-hash={hash} < /home/user/dedup-bench/sto/{i:02} > {REPO}/index/{i:02}")),
            }
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            compression.push(now.elapsed());
            println!("**** {method:?} {i} compression: {:?}", now.elapsed());
        }

        let output = std::process::Command::new("bash").arg("-ec").arg(format!("du --bytes --apparent-size -s {REPO}")).output().unwrap();
        [()] = [output.status.exit_ok().unwrap()];
        assert_eq!(output.stderr.len(), 0);
        let size: usize;
        {
            let x = String::from_utf8(output.stdout).unwrap();
            let caps = regex::Regex::new(r##"^(?<s>[0-9]+)\t/[-/a-z]+\n$"##).unwrap().captures_iter(&x).collect::<Vec<_>>();
            assert!(caps.len() == 1);
            size = caps[0]["s"].parse().unwrap();
        }
        {
            let len = size
                .to_string()
                .chars()
                .collect::<Vec<_>>()
                .rchunks(3)
                .map(|x|x.into_iter().collect::<String>())
                .rev()
                .collect::<Vec<_>>()
                .join("_");
            println!("size: {len}");
        }

        let mut decompression = vec![];
        for i in range.clone() {
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            println!("**** {method:?} {i} decompression...");
            let now = std::time::Instant::now();
            match method {
                Method::ZStd => run(&format!("zstd -d -T0 < {REPO}/{i:02}.zst > {CMP_ME}/data")),
                Method::Borg { .. } => run(&format!("cd {CMP_ME}; BORG_PASSPHRASE=password borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data")),
                Method::Casync { .. } => run(&format!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx {CMP_ME}/data")),
                Method::Zpaq => run(&format!("cd {CMP_ME}; zpaq extract {REPO}/repo.zpaq {i:02}; mv -i {i:02} data")),
                Method::MyChunker { block_bytes: _, zstd: _, ref hash, check } => run(&format!("/home/user/dedup-bench/my-chunker/target/release/my-chunker extract --chunks={REPO}/chunks --strong-hash={hash} --check={check:?} < {REPO}/index/{i:02} > {CMP_ME}/data")),
            }
            run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
            decompression.push(now.elapsed());
            println!("**** {method:?} {i} decompression: {:?}", now.elapsed());
            run(&format!("cmp /home/user/dedup-bench/sto/{i:02} {CMP_ME}/data"));
            run(&format!("rm {CMP_ME}/data"));
        }

        run(&format!("rm -r {REPO}"));
        run(&format!("rm -r {CMP_ME}"));

        let total_compression   = compression.iter().sum();
        let total_decompression = decompression.iter().sum();
        let run = Run { method, size, compression, decompression, total_compression, total_decompression };
        println!("{:?}", run);
        #[derive(Debug, serde::Serialize)]
        struct JsonRun {
            method: String,
            size: usize,
            compression: Vec<u128>,
            decompression: Vec<u128>,
            total_compression: u128,
            total_decompression: u128,
        }
        let Run { method, size, compression, decompression, total_compression, total_decompression } = run;
        println!("{}", serde_json::to_string(&JsonRun {
            method: format!("{:?}", method),
            size,
            compression: compression.into_iter().map(|x|x.as_micros()).collect(),
            decompression: decompression.into_iter().map(|x|x.as_micros()).collect(),
            total_compression: total_compression.as_micros(),
            total_decompression: total_decompression.as_micros(),
        }).unwrap());
    }
}

Programs under test:

  • No deduplication. "Control group". zstd 1.5.4. Command: zstd -3 -T1
  • borg 2.0.0b5. Commands: BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}, BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression zstd,3 {i:02} {i:02}
  • casync 2+20201210-1+b1 (version as reported by debian). Commands: casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02}
  • zpaq 7.15. Commands: zpaq add {REPO}/repo.zpaq {i:02} -threads 1
  • My dedupper written in Rust. /home/user/dedup-bench/my-chunker/target/release/my-chunker add --chunks={REPO}/chunks --block-bytes=4194304 --zstd=3 --strong-hash={hash}

Results in json (this time 34 lines only here, so there is nothing scary here):

{"method":"ZStd","size":11657765795,"compression":[9283690,9678287,11367581,26593358,28771616,28750747,31540400,31680346,32376822,32378808,33297324,33229071],"decompression":[2962382,2323811,2608823,5841187,6354606,5967899,6779104,6508008,6673395,7319743,6901865,7207101],"total_compression":308948055,"total_decompression":67447929}
{"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2851383597,"compression":[18318222,11317438,21237556,45904208,44393790,40996476,48499568,45409089,46290323,44191118,46500109,34804936],"decompression":[8965775,8972195,9445975,35787014,36940309,36681542,37723561,37497282,38120893,37698965,37893902,38152120],"total_compression":447862839,"total_decompression":363879538}
{"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2818873535,"compression":[20102364,12104014,21250364,46696052,45152668,41188824,48250885,44991233,46159774,44186195,36323582,27978731],"decompression":[9221565,9235633,9506502,36005242,36901609,36936570,37744812,37906248,38391585,38012484,38246767,38289875],"total_compression":434384693,"total_decompression":366398899}
{"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2849714452,"compression":[20135335,11978013,21000557,46756333,45916109,41730267,48565669,45327100,46166187,43952637,37075990,28128240],"decompression":[8950584,9058208,9232180,35887827,36183489,36365780,38041709,38177238,37987861,37854767,35663794,36203717],"total_compression":436732442,"total_decompression":359607160}
{"method":"Borg { chunker_params: \"buzhash,10,23,16,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2149064038,"compression":[21433546,12854993,19036014,45568056,46895977,42824579,48388277,46630976,48315401,46168988,37319377,29560584],"decompression":[9630351,9612465,10176755,36682904,37336673,37462619,38755277,38619795,38835548,39275470,39528064,39358626],"total_compression":444996774,"total_decompression":375274551}
{"method":"Borg { chunker_params: \"buzhash,10,23,16,4095\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2146105711,"compression":[21699606,12760410,19070372,44908917,46769279,43102345,48906422,47002961,48680681,46378196,36566607,29968325],"decompression":[9835854,9947821,10439907,36840113,38043775,38003903,39142987,39261288,39676490,39826286,39719886,39615441],"total_compression":445814126,"total_decompression":380353757}
{"method":"Borg { chunker_params: \"buzhash,10,23,16,4095\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2157499367,"compression":[20231258,11349910,17575160,43394101,45211290,41337098,46657102,44495116,45848495,43480255,34457257,27440901],"decompression":[8551623,8478314,8925691,33317706,35799953,35982652,36745073,34477784,34747132,37280720,37107451,37128391],"total_compression":421477949,"total_decompression":348542494}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2162838917,"compression":[19109363,10336038,16429374,35520360,36950678,33812374,38626489,37122860,38390452,36431809,29693996,25317951],"decompression":[9474411,9369419,9906256,36352626,37122162,36789206,38670100,38596851,38257199,38321414,38882990,38705175],"total_compression":357741749,"total_decompression":370447816}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2164563083,"compression":[19867389,10839858,17143115,37769750,38042463,33716280,39030970,37433763,38800999,36812705,30371597,25625937],"decompression":[9869781,9855416,10347908,36474136,37551748,37721483,39537396,39539771,38971688,39011972,39560917,39063998],"total_compression":365454832,"total_decompression":377506221}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2164097589,"compression":[18323471,9420528,15444298,36607423,36075649,32132941,36828551,34915397,36142264,34018785,27635133,22985850],"decompression":[8284186,8156134,8694047,29891890,30563610,30278642,32062420,31944804,31583080,32087650,32181069,32029320],"total_compression":340530295,"total_decompression":307756859}
{"method":"Borg { chunker_params: \"buzhash,19,23,21,48\", compression: \"zstd,3\", encryption: \"none\" }","size":2835630939,"compression":[19633874,11845746,20817763,45678328,44171313,41569956,48457003,45433164,46500342,44508459,35825546,28015594],"decompression":[9040288,9055733,9340413,35888074,37150820,36750609,37612061,37497925,38169093,37852680,38028829,38221644],"total_compression":432457094,"total_decompression":364608174}
{"method":"Borg { chunker_params: \"buzhash,19,23,21,48\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2837102441,"compression":[20211541,12130002,21500578,47107778,45693547,41644459,48569615,45460227,46557889,44690148,36637580,28108682],"decompression":[9161378,9090975,9472423,36316948,36667485,36725586,37645070,37964265,38433782,38147539,38428219,38334092],"total_compression":438312053,"total_decompression":366387767}
{"method":"Borg { chunker_params: \"buzhash,19,23,21,48\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2841300228,"compression":[19839563,11948675,20904316,46158648,44926572,40998189,48310763,45180332,46265710,44223851,36329235,27875015],"decompression":[8834926,8981821,9498129,32888619,36480689,36418669,37700783,34601169,35593303,37906159,37761933,38201286],"total_compression":432960874,"total_decompression":354867489}
{"method":"Borg { chunker_params: \"buzhash,10,23,16,48\", compression: \"zstd,3\", encryption: \"none\" }","size":2135031845,"compression":[21824758,12569046,18681384,45450353,47126995,43234937,48642344,46956004,48549333,45958714,37056332,29454305],"decompression":[9653232,9603802,10341111,36923956,37754482,37579775,38597648,39008547,39077172,38761438,39343815,39212642],"total_compression":445504510,"total_decompression":375857624}
{"method":"Borg { chunker_params: \"buzhash,10,23,16,48\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2145692735,"compression":[21699815,12685825,19091380,45176414,46856537,43071094,48642945,46888967,48576570,46042103,36645446,29798289],"decompression":[9829861,9798758,10423285,36929763,37717415,37929072,38826619,39035325,39339638,39266448,39800269,39882055],"total_compression":445175390,"total_decompression":378778514}
{"method":"Borg { chunker_params: \"buzhash,10,23,16,48\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2144992581,"compression":[20044677,11372242,17551540,43968373,45297914,41559775,46745420,44649502,46056635,43873550,33579295,27196563],"decompression":[8439352,8370019,8874321,33292381,35650680,35798891,34367147,36168242,36831071,36745149,34808103,36881564],"total_compression":421895491,"total_decompression":346226925}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\", encryption: \"none\" }","size":2149367658,"compression":[19202188,10282185,16158967,35386376,36863967,33524678,38459514,36883128,38468975,36310201,30181380,25607996],"decompression":[9560774,9720572,10081141,36231706,37176500,36832882,38732369,38359583,38258917,38308732,38774052,38777435],"total_compression":357329559,"total_decompression":370814669}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2165212174,"compression":[19994496,11001909,16816310,37486176,38088082,33874466,38906818,37310671,38929168,36742808,30276421,25853547],"decompression":[9848387,9860636,10360622,36518689,37782960,37368730,39356883,39091052,39301352,38903208,39747421,39198518],"total_compression":365280878,"total_decompression":377338462}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2170608909,"compression":[18257714,9587410,15555097,35455938,36490702,32159211,36668103,34849661,36387537,34196607,27808237,22916115],"decompression":[8204530,8187822,8610032,28935022,30186082,29866673,32000599,31765863,32466908,31627415,32487425,32223529],"total_compression":340332337,"total_decompression":306561906}
{"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,3\", encryption: \"none\" }","size":2837954615,"compression":[16641977,8981518,17861236,31209683,29905411,27442906,34127943,31146410,32103404,30310530,21990501,14716631],"decompression":[9042302,9005331,9466467,35760136,36634388,36565886,37465949,37453933,37762713,37556223,37919234,37787852],"total_compression":296438154,"total_decompression":362420421}
{"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2837954708,"compression":[17080083,9349320,18338772,32196291,30998950,27604881,34254323,31223193,32094800,30438246,22705156,14743870],"decompression":[9182614,9202718,9594391,35669209,36698953,36712690,37423578,37843085,37820586,38092966,37820292,38017167],"total_compression":301027891,"total_decompression":364078253}
{"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2837955293,"compression":[16681822,9286359,18376840,31673226,30222772,27483311,33793628,30569652,31373937,29621764,21741968,14200312],"decompression":[9017839,9029086,9237240,35717027,34026683,32061108,37518438,37291120,37486746,37251674,35155094,33467949],"total_compression":295025598,"total_decompression":347260010}
{"method":"Casync { chunker_params: \"524288:2097152:8388608\" }","size":2879294306,"compression":[37502132,29828953,38388829,162425453,158344403,155316081,159016379,155002634,155527517,153506076,154936234,153428320],"decompression":[6906559,6962960,7419022,25905660,26947735,26924282,27862707,27946487,28365741,28437593,28403558,28441681],"total_compression":1513223016,"total_decompression":270523990}
{"method":"Casync { chunker_params: \"1024:65536:8388608\" }","size":2095335450,"compression":[40231441,32560349,37437359,166501923,166456700,163119571,165479303,164293470,164053685,163130507,163433177,163316443],"decompression":[6807262,6824586,7248889,26100854,26633976,26576270,27336775,27486418,27915298,27683154,27854510,28035610],"total_compression":1590013933,"total_decompression":266503608}
{"method":"Casync { chunker_params: \"16384:65536:262144\" }","size":2123125835,"compression":[36880804,28885367,33321629,156750983,155885464,153104536,155009332,152763670,153169678,151563761,151688622,151495075],"decompression":[6573250,6564462,7090823,23860988,24478632,24588526,25476227,25581478,26007763,25979937,26140739,26167777],"total_compression":1480518927,"total_decompression":248510607}
{"method":"Zpaq","size":2020546862,"compression":[64846208,15999299,45113975,84095781,88217304,71561268,87819893,74730235,79004485,72064634,74074986,73135629],"decompression":[11504019,11442280,17221606,22817373,27206092,27206680,31463812,32912227,34842936,34780759,35448384,35439551],"total_compression":830663702,"total_decompression":322285723}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"murmur3\", check: false }","size":2837727787,"compression":[11972998,4501924,12993601,28032655,25078869,21966876,26911725,23199234,23649817,21983818,23332558,22013069],"decompression":[2742755,2685176,3043580,6158789,6734138,6970033,7542192,7608766,8009475,7905699,8143381,8095599],"total_compression":245637149,"total_decompression":75639589}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"murmur3\", check: true }","size":2837727787,"compression":[11975526,4508979,12905268,28090643,25081155,21973767,26848370,23173431,23627225,22000042,23324900,22000017],"decompression":[3306916,3255777,3611647,8924207,10175322,9619307,10386219,10362477,11191207,10910435,10920920,10985661],"total_compression":245509328,"total_decompression":103650100}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"sha256\", check: false }","size":2838121003,"compression":[16749795,9173678,17869084,51539683,48496978,45374019,50195634,46567898,47029669,45336113,46776453,45328806],"decompression":[2753205,2697572,3062993,6109653,6723439,6694980,7456240,7805207,7903225,7941173,8034388,8094563],"total_compression":470437817,"total_decompression":75276643}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"sha256\", check: true }","size":2838121003,"compression":[16691595,9177770,17885076,51471117,48542486,45351514,50248956,46587920,47013303,45374848,46801416,45384168],"decompression":[8165492,8177129,8513070,33387636,34289479,33884103,34666111,34804762,35095678,35119255,35282643,35206638],"total_compression":470530175,"total_decompression":336592002}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake3\", check: false }","size":2838121003,"compression":[12079168,4576985,13078441,28469710,25368211,22287125,27185902,23682801,23922831,22378973,23702148,22375989],"decompression":[2719902,2694905,3049538,6100986,6721047,6668222,7668869,7601103,7961794,7940910,8065481,8110993],"total_compression":249108289,"total_decompression":75303755}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake3\", check: true }","size":2838121003,"compression":[12025524,4493332,13012576,28009321,24979423,21895509,26819883,23100268,23629206,21931283,23282489,21969911],"decompression":[3458388,3461221,3816301,9637175,10457420,10680664,11447276,11428309,11520339,11409510,11951739,11662459],"total_compression":245148729,"total_decompression":110930807}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake2b\", check: false }","size":2838907435,"compression":[13230124,5637490,14433041,33935850,30851062,27675085,32690185,30380651,29399646,27728548,29126626,27772168],"decompression":[2714048,2688687,3056966,6073851,6747581,6908087,7494888,7550462,7893937,7930224,8101463,8143793],"total_compression":302860481,"total_decompression":75303993}
{"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake2b\", check: true }","size":2838907435,"compression":[13097185,5644976,14237922,33926943,30907781,27754915,32844449,29006030,30476310,27681230,30693571,27749998],"decompression":[4825484,4872713,4938557,15678113,17426670,16434735,17054877,17291963,17479366,18607035,17708124,17825607],"total_compression":304021316,"total_decompression":170143248}

The same as a nice table:

Program under test                                                                                               Compressed size (bytes)        Co. time        De. time  Co. time of 11  De. time of 11
ZStd                                                                                                                      11_657_765_795      308.948055       67.447929       33.229071        7.207101
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "none" }                                2_851_383_597      447.862839      363.879538       34.804936       38.152120
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated" }                       2_818_873_535      434.384693      366.398899       27.978731       38.289875
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_849_714_452      436.732442      359.607160       28.128240       36.203717
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "none" }                                2_149_064_038      444.996774      375.274551       29.560584       39.358626
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated" }                       2_146_105_711      445.814126      380.353757       29.968325       39.615441
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_157_499_367      421.477949      348.542494       27.440901       37.128391
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "none" }                                2_162_838_917      357.741749      370.447816       25.317951       38.705175
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated" }                       2_164_563_083      365.454832      377.506221       25.625937       39.063998
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_164_097_589      340.530295      307.756859       22.985850       32.029320
Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "none" }                                  2_835_630_939      432.457094      364.608174       28.015594       38.221644
Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "authenticated" }                         2_837_102_441      438.312053      366.387767       28.108682       38.334092
Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_841_300_228      432.960874      354.867489       27.875015       38.201286
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "none" }                                  2_135_031_845      445.504510      375.857624       29.454305       39.212642
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated" }                         2_145_692_735      445.175390      378.778514       29.798289       39.882055
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_144_992_581      421.895491      346.226925       27.196563       36.881564
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "none" }                                  2_149_367_658      357.329559      370.814669       25.607996       38.777435
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated" }                         2_165_212_174      365.280878      377.338462       25.853547       39.198518
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_170_608_909      340.332337      306.561906       22.916115       32.223529
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "none" }                                        2_837_954_615      296.438154      362.420421       14.716631       37.787852
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated" }                               2_837_954_708      301.027891      364.078253       14.743870       38.017167
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_955_293      295.025598      347.260010       14.200312       33.467949
Casync { chunker_params: "524288:2097152:8388608" }                                                                        2_879_294_306     1513.223016      270.523990      153.428320       28.441681
Casync { chunker_params: "1024:65536:8388608" }                                                                            2_095_335_450     1590.013933      266.503608      163.316443       28.035610
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1480.518927      248.510607      151.495075       26.167777
Zpaq                                                                                                                       2_020_546_862      830.663702      322.285723       73.135629       35.439551
MyChunker { block_bytes: 4194304, zstd: 3, hash: "murmur3", check: false }                                                 2_837_727_787      245.637149       75.639589       22.013069        8.095599
MyChunker { block_bytes: 4194304, zstd: 3, hash: "murmur3", check: true }                                                  2_837_727_787      245.509328      103.650100       22.000017       10.985661
MyChunker { block_bytes: 4194304, zstd: 3, hash: "sha256", check: false }                                                  2_838_121_003      470.437817       75.276643       45.328806        8.094563
MyChunker { block_bytes: 4194304, zstd: 3, hash: "sha256", check: true }                                                   2_838_121_003      470.530175      336.592002       45.384168       35.206638
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake3", check: false }                                                  2_838_121_003      249.108289       75.303755       22.375989        8.110993
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake3", check: true }                                                   2_838_121_003      245.148729      110.930807       21.969911       11.662459
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: false }                                                 2_838_907_435      302.860481       75.303993       27.772168        8.143793
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true }                                                  2_838_907_435      304.021316      170.143248       27.749998       17.825607

/proc/cpuinfo of host:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
stepping        : 4
microcode       : 0x2f
cpu MHz         : 3092.919
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm arat pln pts md_clear flush_l1d
vmx flags       : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown
bogomips        : 6185.31
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
stepping        : 4
microcode       : 0x2f
cpu MHz         : 3092.730
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm arat pln pts md_clear flush_l1d
vmx flags       : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown
bogomips        : 6185.31
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Output of borg2 benchmark cpu:

# borg2 benchmark cpu
Chunkers =======================================================
buzhash,19,23,21,4095    1GB        1.378s
fixed,1048576            1GB        0.115s
Non-cryptographic checksums / hashes ===========================
xxh64                    1GB        0.239s
crc32 (zlib)             1GB        0.436s
Cryptographic hashes / MACs ====================================
hmac-sha256              1GB        2.556s
blake2b-256              1GB        2.513s
Encryption =====================================================
aes-256-ctr-hmac-sha256  1GB        3.180s
aes-256-ctr-blake2b      1GB        3.745s
aes-256-ocb              1GB        0.581s
chacha20-poly1305        1GB        1.261s
KDFs (slow is GOOD, use argon2!) ===============================
pbkdf2                   5          0.358s
argon2                   5          0.553s
Compression ====================================================
lz4          0.1GB      0.016s
zstd,1       0.1GB      0.041s
zstd,3       0.1GB      0.052s
zstd,5       0.1GB      0.497s
zstd,10      0.1GB      2.175s
zstd,16      0.1GB      10.881s
zstd,22      0.1GB      10.380s
zlib,0       0.1GB      0.071s
zlib,6       0.1GB      2.480s
zlib,9       0.1GB      2.480s
lzma,0       0.1GB      16.766s
lzma,6       0.1GB      29.766s
lzma,9       0.1GB      24.211s
msgpack ========================================================
msgpack      100k Items 0.100s

Now conclusions.

  • Repo created with borg2 rcreate -e authenticated-blake2 is faster than borg2 rcreate -e none. But this is completely undiscoverable! Please, add to very beginning of section "Encryption mode TLDR" here: https://borgbackup.readthedocs.io/en/2.0.0b5/usage/rcreate.html phrase "Even if you don't want encryption and authentication, note that adding authentication can make borg faster". If I didn't want to create this benchmark, I would never know this! Proof:
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "none" }                                2_851_383_597      447.862839      363.879538       34.804936       38.152120
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated" }                       2_818_873_535      434.384693      366.398899       27.978731       38.289875
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_849_714_452      436.732442      359.607160       28.128240       36.203717
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "none" }                                2_149_064_038      444.996774      375.274551       29.560584       39.358626
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated" }                       2_146_105_711      445.814126      380.353757       29.968325       39.615441
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_157_499_367      421.477949      348.542494       27.440901       37.128391
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "none" }                                2_162_838_917      357.741749      370.447816       25.317951       38.705175
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated" }                       2_164_563_083      365.454832      377.506221       25.625937       39.063998
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_164_097_589      340.530295      307.756859       22.985850       32.029320
  • casync compresses a way slower than borg on same parameters. But decompression is way faster (casync doesn't check hashes on decompression? Or uses some superfast hash?). Proof:
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "none" }                                  2_135_031_845      445.504510      375.857624       29.454305       39.212642
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated" }                         2_145_692_735      445.175390      378.778514       29.798289       39.882055
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_144_992_581      421.895491      346.226925       27.196563       36.881564
Casync { chunker_params: "1024:65536:8388608" }                                                                            2_095_335_450     1590.013933      266.503608      163.316443       28.035610
  • borg can achieve nearly same compress ratio as zpaq while being faster both in compression and decompression. And compression is 2x faster
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_170_608_909      340.332337      306.561906       22.916115       32.223529
Zpaq                                                                                                                       2_020_546_862      830.663702      322.285723       73.135629       35.439551
  • borg compresses faster than my dedupper on same settings in sha-256 mode. I think this is because of problems in Rust's sha-256 implementation (decompression speed is similar)
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "none" }                                        2_837_954_615      296.438154      362.420421       14.716631       37.787852
MyChunker { block_bytes: 4194304, zstd: 3, hash: "sha256", check: true }                                                   2_838_121_003      470.530175      336.592002       45.384168       35.206638
  • my dedupper decompresses way faster than borg in blake2b mode on same settings (and this time I check hashes when decompressing!) (compression speed is similar)
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_955_293      295.025598      347.260010       14.200312       33.467949
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true }                                                  2_838_907_435      304.021316      170.143248       27.749998       17.825607
  • In my dedupper blake3 is as fast as murmur3. It is cool
  • In my dedupper blake3 is way faster than blake2b

Now let me list actionable items (in my opinion) for borg:

  • Add note to docs for rcreate (see above)
  • In blake2b mode my program decompresses x2 faster than borg. I. e. something going really wrong here. My decompressor does absolutely trivial things. It decompresses, then checks hashes
  • Consider adding support for blake3, it is really faster than blake2b

Rust's sha256 performance problems are strange. I will try to investigate more and possibly will report bug to sha256 implementations. It seems that my host supports sha256 hardware acceleration, but Rust's implementations are unable to detect this (but python's can)

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 26, 2023

IIRC, we already have some notes in the docs that speed advantages of misc. algorithms depend on the CPU capabilities. But, as you have experienced, this is sometimes a bit tricky or even counter-intuitive. Because of that, I have added a simple borg benchmark cpu to borg2, so people can just run that to get an impression about which algorithms are fastest for their specific environment.

I know that blake3 is even faster than blake2b, but:

  • blake3-py seems to still depend on rust, not sure if all platforms borg works on have that (and even if they have it, it would mean that if borg depends on blake3-py, borg users would need to have the rust toolchain installed to install borg if there is no binary wheel for blake3-py for that platform). the C-based stuff is still considered experimental (for quite long meanwhile and there does not seem to be much interest in that from the developers).
  • because of borg transfer, borg2 needs to support all id-hashes it previously supported, so we can only add blake3, but not remove blake2b. also, the code structure in borg dealing with this does not scale well to a lot of algorithms.
  • if the CPU supports sha2 hw accel, (hmac-)sha256 is faster than blake2 or blake3. sha2 support is increasingly common in CPUs.

Speed: in general, if some other code than borg is faster, that does not imply that there is something wrong in borg. It could be just that borg does something that that code does not do. But if that is something useful, that's good. So, such a finding needs more research. It might well be though that borg extract is less optimized yet than borg create- it is also used less frequently.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 26, 2023

Had a look and also re-grouped the lines, removed some less relevant ones, maybe it is useful if someone else looks at this:

Program under test                                                                                               Compressed size (bytes)        Co. time        De. time  Co. time of 11  De. time of 11

ZStd                                                                                                                      11_657_765_795      308.948055       67.447929       33.229071        7.207101

Zpaq                                                                                                                       2_020_546_862      830.663702      322.285723       73.135629       35.439551

Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_849_714_452      436.732442      359.607160       28.128240       36.203717
Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_841_300_228      432.960874      354.867489       27.875015       38.201286
Casync { chunker_params: "524288:2097152:8388608" }                                                                        2_879_294_306     1513.223016      270.523990      153.428320       28.441681

Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_157_499_367      421.477949      348.542494       27.440901       37.128391
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_144_992_581      421.895491      346.226925       27.196563       36.881564
Casync { chunker_params: "1024:65536:8388608" }                                                                            2_095_335_450     1590.013933      266.503608      163.316443       28.035610

Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_164_097_589      340.530295      307.756859       22.985850       32.029320
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_170_608_909      340.332337      306.561906       22.916115       32.223529
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1480.518927      248.510607      151.495075       26.167777

Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_955_293      295.025598      347.260010       14.200312       33.467949
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true }                                                  2_838_907_435      304.021316      170.143248       27.749998       17.825607

@jdchristensen
Copy link
Contributor

Why would borg be faster using -e authenticated-blake2 than when using -e none? Can you re-run those tests to confirm? Run multiple times to be sure this isn't an issue with the kernel caching data, or random variation.

@ThomasWaldmann
Copy link
Member

@jdchristensen it has to compute the id-hash in any case and if blake2b is faster than sha256 (which is rather slow if computed purely in sw), that might be the reason. otoh, whether a secret key is involved (authenticated-blake2) or not (none) doesn't make a significant difference.

@jdchristensen
Copy link
Contributor

Right, that makes sense. With borg2, one can use borg2 benchmark cpu, and on the machines I tested, sha256 is faster than blake2b. The tests above must be on a system with a slow sha256.

@safinaskar
Copy link
Author

safinaskar commented Jun 26, 2023

I did some more research using crate https://docs.rs/cpufeatures and now I see that my host has no hardware sha support. Rust's various sha256 implementations are slow on my host, I reported this here: RustCrypto/hashes#327 (comment) . So result "-e authenticated-blake2 is faster than -e none" is predictable

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 26, 2023

One idea to make borg extract faster (in some cases): #1678

@safinaskar
Copy link
Author

I'm attempting to understand why borg create is so fast. And why it often significantly faster than my dedupper. I'm reading borg's codebase. I extracted from borg codebase small piece of code, which reads file, splits it to equally-sized chunks and computes crc32 hash of each of them (yes, I replaced hashes supported by borg with crc32). Here is what I learned: LRU cache of hashes of zero blocks matters. I mean this line:

chunk_id = zero_chunk_ids[(id_hash, size)]
. If I remove this LRU cache, then time of reading and hashing typical 10 GiB file from my data grows from 5 secs to 8 secs. (It seems my files contain lot of zero blocks.)

But even if I remove LRU cache, borg is still sometimes faster than my code, so I'm continuing to investigate why it is faster

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 30, 2023

@safinaskar sure, some sort of files, like disk image files can contain huge amounts of all-zero chunks (in the optimal case as sparse areas, but also as on-disk zeros) and the chunkers will often cut chunks of a few specific sizes, so not re-computing the same hash again and again saves a lot of time. IIRC, borg create has this optimisation but borg extract does not have any special optimisation for repeating (all-zero) chunks yet.

another optimisation is that borg's fixed chunker can deal with sparse files and does not read the sparse sections from disk, but seeks over them. the buzhash chunker does not have real sparse support because a lot of it is C code and it is already quite complex even without that. for disk images, one should rather use the fixed chunker anyway.

@safinaskar
Copy link
Author

I found another bunch of facts (when benchmarking)

  • Reading of sparse files (as you just told) matters. (I. e. setting sparse to True in
    def __init__(self, block_size, header_size=0, sparse=False):
    )
  • POSIX_FADV_DONTNEED matters.
    os.posix_fadvise(fh, offset, len(data), os.POSIX_FADV_DONTNEED)

@safinaskar
Copy link
Author

I did more tests and found that POSIX_FADV_DONTNEED actually causes very big speedup in make mode in my dedupper (this mode corresponds to borg2 create). In one particular test I got x2.5 speedup (but I used zstd level -22 and blake3). But to get speedup from POSIX_FADV_DONTNEED many conditions must hold simultaneously:

  • Reading should be from disk, not from tmpfs
  • Disk should be slow enough (I got x2.5 speedup on slow [network] disk and x1.075 speedup on fast disk)
  • Input data should be already in OS cache (i. e. one need to execute cat input-data > /dev/null before testing)
  • All data should not fit to memory

Here is how I did my tests:

  • I dropped caches using sync; echo 3 > /proc/sys/vm/drop_caches (to make sure previous runs don't affect us)
  • I calculate free memory and make sure that free memory is less than total data size (i. e. total size of files 00, 01, etc), but more than size of any single input file. Well, I don't know how exactly one should calculate free memory size. Whether I should use formula MemAvailable - Shmem or formula MemFree - Shmem (these are names of fields in /proc/meminfo). So I simply make sure that both formulas give number, which is less than total data size (but bigger than any particular file). Note that I don't have swap. If available memory is too big, then I make it lesser by creating big file in tmpfs
  • I precache zeroth file using cat 00 > /dev/null
  • I run my dedupper and measure its time, i. e. something similar to time -p my-dedupper make 00 (this is analog of borg2 create)
  • Then I precache first file and run dedupper on it (i. e. I add file 01 to same repo). Then same for second file, etc
  • In the end I sum all measured times

And this algorithm causes POSIX_FADV_DONTNEED to give x2.5 speedup.

You may say that this is very artificial set of conditions. Well, no. Speedup appeared naturally. Here is what happened: at first I did very unscientific benchmarks and found that borg2 is faster than my dedupper. Then I tried to understand why it is so and found that one of the reasons is POSIX_FADV_DONTNEED, I reported this here: #7674 (comment) . Then I discovered that POSIX_FADV_DONTNEED not always gives speedup, so I tried to find exact set of conditions that cause it. And I found them.

Note that if some of conditions don't hold, then we can get great slow down instead of speedup. For example, if available memory is bigger than all input data, then we get x1.3 slow down instead of x2.5 speedup

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jul 22, 2023

- Input data should be already in OS cache (i. e. one need to execute cat input-data > /dev/null before testing)
- I dropped caches using sync; echo 3 > /proc/sys/vm/drop_caches (to make sure previous runs don't affect us)

Guess the 2nd will mostly neutralise the effect of the first.

Although there might be a small difference:

  • first might impose some memory pressure so other stuff not needed gets discarded or paged out
  • second frees the cache memory allocated by first
  • so in the end, there might be more free memory than before

About why borg is using POSIX_FADV_DONTNEED:

  • when doing a backup, borg reads all new/changed data exactly once, but the likelihood that this data is read again soon afterwards is rather small (borg does not do it, neither does something else for most cases). thus it is pointless to keep this data in the cache. when a lot of data is changed (or new), that would impose a lot of useless memory pressure when borg reads many GBs or even TBs of data. in a bad case, other more useful data in the cache might get discarded, in the worst case, the system might even decide to page in/out a lot.
  • there is only one case when DONTNEED has a small disadvantage: if borg does that to a heavily used file (e.g. an active database). but that is only a one-time effect: after a file range has been discarded by DONTNEED, the next access (e.g. by the db server) will re-read it from disk and then it will be in the cache again.
  • it is similar for the repo files: borg writes them once while doing a backup, but then they are not accessed any more - another case for DONTNEED.

About the slowdown:

  • guess the memory management work caused by POSIX_FADV_DONTNEED costs some time, so if there is a huge amount of memory, i guess it really could be faster at first, for the measured process, to not invoke DONTNEED
  • but, assuming borg is not the only thing using the cache, that only shifts the memory management work (and time needed for that) to another point in time. if the memory is needed otherwise, the same mm work will be done, sooner or later.

@safinaskar
Copy link
Author

Thanks for answer!

Guess the 2nd will mostly neutralise the effect of the first

There is some misunderstanding here. I first drop caches and then pre-cache input file. My tests look similar to this:

#!/bin/bash
sync
echo 3 > /proc/sys/vm/drop_caches
for I in {00..11}; do
  cat "$I" > /dev/null
  time -p path-to/my-dedupper make "$I" ......
done

@ThomasWaldmann
Copy link
Member

Not sure why you do the cat - your dedupper will read the whole file anyway, right?

Also, it is not a very realistic scenario.

@safinaskar
Copy link
Author

Again: I found that POSIX_FADV_DONTNEED is fast, and then I tried to write down exact scenario when it is fast. And it turned out that I need that cat to reproduce this speedup. Yes, I agree, this is unrealistic scenario. (Also, I just found that in my use case input file is always on tmpfs, so this test doesn't agree with my use case.)

@safinaskar
Copy link
Author

I added more optimizations to my dedupper (in particular, it is parallel now). And now I did truly scientific benchmark.

Now my dedupper is parallel thanks to great Rust library "rayon". (My dedupper is called azwyon now, this name was generated using /dev/urandom.) rayon is absolutely unique library. This is work-stealing parallelism library, which is fast and safe in the same time thanks to unique Rust features. I think this would be impossible to write something like rayon in any other language. Here is explained why: https://smallcultfollowing.com/babysteps/blog/2015/12/18/rayon-data-parallelism-in-rust/ .

Still, rayon lacked particular feature I need, so I implemented it: rayon-rs/rayon#1071 .

Also, this new version of dedupper manages not only chunks, but also index (as opposed to previous version).

Now let's proceed to experiment description.

Now I do truly scientific benchmark using these advises: https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux , https://llvm.org/docs/Benchmarking.html .

I use AWS bare metal server for benchmarks. Yes, my previous AWS experience was bad. I don't know why. Possibly this is because I did read data from networked disk. This time everything is on AWS bare metal server, in tmpfs.

AWS has many threads, but I restricted all tools to 4 CPUs using cset.

I did run every tool in its best settings. For example, all single-threaded tools run in single-threaded mode, but all parallel-capable tools run in parallel mode. Yes, this is probably "unfair", but goal for my benchmark is pragmatic: choosing tool for my use case. So I simply run every tool in its "best" mode.

Tools under test are:

  • zstd 1.5.4
  • borg 2.0.0b5 (in fixed chunking mode, blake2b)
  • casync 2+20201210-1+b1
  • zpaq 7.15
  • azwyon (fixed chunking)
  • desync a0052781292e3d140f6f117a2d608a22471e86e8 (yes, I added desync!)

All parallel capable tools run in parallel mode. These are: zstd, zpaq, azwyon, desync. Single-threaded: borg, casync.

All tools (except for zpaq) use zstd level 3 compression.

Comparison between borg and azwyon is "completely unfair", because:

  • azwyon is parallel and borg not
  • azwyon used blake3 and borg - blake2b
  • azwyon doesn't check hash when decompressing and borg checks

Yes, I know this. I simply want to find what suits my case best. If you disagree, you may ask to re-run benchmark with different settings.

Okay, so here is code for azwyon: https://paste.gg/p/anonymous/7842d4fc787a4b42bcc4ae02a6e0ecfa

And here is code for benchmark: https://paste.gg/p/anonymous/2bac07a20d5f48dbabd4d9e3ad2c4ab1

And here are results:

Name                                                                                                                                Size     Total ctime     Total dtime      Last ctime      Last dtime
-
ZStd                                                                                                                      11_657_765_795       65.349372       70.824446        7.127620        7.590774
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_956_595      135.743906      322.220078       11.352489       33.698906
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1192.330055      226.266151      121.339989       23.731491
Zpaq                                                                                                                       2_020_546_862      855.321357      288.076099       96.098747       31.608796
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false }                                          2_838_121_003       34.133136       48.650205        2.729053        4.986763
Desync { chunker_params: "16:64:256" }                                                                                     2_138_317_619      271.476648       88.211184       27.106573        9.041686

Exact command lines:

        // Initializing
        match &method {
            Method::ZStd => {},
            Method::Borg { encryption, .. } => cmd!("BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}"),
            Method::Casync { .. } => cmd!("mkdir {REPO}/index {REPO}/storage.castr"),
            Method::Zpaq => {},
            Method::Azwyon { .. } => cmd!("mkdir {REPO}/chunks {REPO}/index"),
            Method::Desync { .. } => cmd!("mkdir {REPO}/index {REPO}/storage.castr"),
        }
// //
            match &method {
                Method::ZStd => cmd!("zstd -3 -T0 < /home/user/dedup-bench/input-fs/{i:02} > {REPO}/{i:02}.zst"),
                Method::Borg { chunker_params, compression, encryption: _ } => cmd!("cd /home/user/dedup-bench/input-fs; BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}"),
                Method::Casync { chunker_params } => cmd!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/{i:02} > /dev/null"),
                Method::Zpaq => cmd!("cd /home/user/dedup-bench/input-fs; zpaq add {REPO}/repo.zpaq {i:02} > /dev/null"),
                Method::Azwyon { chunk_size, zstd, digest, check_extracted: _ } => cmd!("/home/user/dedup-bench/azwyon-dedup/target/release/azwyon-dedup make --chunk-size={chunk_size} --store={REPO} --zstd={zstd} --id={i:02} --digest={digest} < /home/user/dedup-bench/input-fs/{i:02}"),
                Method::Desync { chunker_params } => cmd!("/desync make -m {chunker_params} -s {REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/{i:02}"),
            }
// //
            match method {
                Method::ZStd => cmd!("zstd -d -T0 < {REPO}/{i:02}.zst > /home/user/dedup-bench/input-fs/data"),
                Method::Borg { .. } => cmd!("cd /home/user/dedup-bench/input-fs; BORG_PASSPHRASE=password borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data"),
                Method::Casync { .. } => cmd!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/data"),
                Method::Zpaq => cmd!("cd /home/user/dedup-bench/input-fs; zpaq extract {REPO}/repo.zpaq {i:02} > /dev/null; mv -i {i:02} data"),
                Method::Azwyon { chunk_size: _, zstd: _, ref digest, check_extracted } => cmd!("/home/user/dedup-bench/azwyon-dedup/target/release/azwyon-dedup extract --store={REPO} --id={i:02} --to=/home/user/dedup-bench/input-fs/data --digest={digest} --check-extracted={check_extracted:?}"),
                Method::Desync { .. } => cmd!("/desync extract -s {REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/data"),
            }

Output of borg2 benchmark cpu:

$ borg2 benchmark cpu
Chunkers =======================================================
buzhash,19,23,21,4095    1GB        1.178s
fixed,1048576            1GB        0.080s
Non-cryptographic checksums / hashes ===========================
xxh64                    1GB        0.074s
crc32 (zlib)             1GB        0.282s
Cryptographic hashes / MACs ====================================
hmac-sha256              1GB        2.261s
blake2b-256              1GB        2.109s
Encryption =====================================================
aes-256-ctr-hmac-sha256  1GB        2.844s
aes-256-ctr-blake2b      1GB        3.501s
aes-256-ocb              1GB        0.439s
chacha20-poly1305        1GB        0.568s
KDFs (slow is GOOD, use argon2!) ===============================
pbkdf2                   5          0.692s
argon2                   5          0.269s
Compression ====================================================
lz4          0.1GB      0.021s
zstd,1       0.1GB      0.034s
zstd,3       0.1GB      0.040s
zstd,5       0.1GB      0.348s
zstd,10      0.1GB      0.832s
zstd,16      0.1GB      8.699s
zstd,22      0.1GB      12.428s
zlib,0       0.1GB      0.058s
zlib,6       0.1GB      2.603s
zlib,9       0.1GB      2.603s
lzma,0       0.1GB      23.838s
lzma,6       0.1GB      32.118s
lzma,9       0.1GB      29.509s
msgpack ========================================================
msgpack      100k Items 0.248s

Result: azwyon is fastest. :)

On compression. azwyon compresses x2 faster than second place, i. e. zstd. Yes, azwyon compresses data using zstd level 3 faster than zstd level 3 itself. :) This is because azwyon doesn't compress same blocks twice. 3rd place is borg's. And azwyon is x4 faster than borg.

On decompression. Well, azwyon is fastest again.

Why azwyon is so fast? Well, this is not because of POSIX_FADV_DONTNEED. POSIX_FADV_DONTNEED matters for disks only, and this benchmark was on tmpfs. So, what are reasons? First of all, I do very few things. Second, I do compression/decompression in parallel. Third, I don't have any kind of database or locking. Repo maintains its consistency thanks to open with O_TMPFILE. Yes, this means that you can add two files to repo in parallel and consistency will be maintained! (But I didn't test this yet.)

@safinaskar
Copy link
Author

I added rdedup ( https://github.com/dpc/rdedup ). It is written in Rust, too. It is parallel. It uses content-defined chunking. I tested blake2b hash.

Name                                                                                                                                Size     Total ctime     Total dtime      Last ctime      Last dtime
-
ZStd                                                                                                                      11_657_765_795       65.349372       70.824446        7.127620        7.590774
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_956_595      135.743906      322.220078       11.352489       33.698906
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1192.330055      226.266151      121.339989       23.731491
Zpaq                                                                                                                       2_020_546_862      855.321357      288.076099       96.098747       31.608796
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false }                                          2_838_121_003       34.133136       48.650205        2.729053        4.986763
Desync { chunker_params: "16:64:256" }                                                                                     2_138_317_619      271.476648       88.211184       27.106573        9.041686
RDedup { chunker_params: "fastcdc", chunk_size: "4096K" }                                                                  3_009_007_386      111.050768      271.345209        9.683813       27.996724

Command lines:

Method::RDedup { chunker_params, chunk_size, } => cmd!("rdedup --dir {REPO} init --chunking {chunker_params} --chunk-size {chunk_size} --compression-level 3 --encryption none --nesting 0"),
Method::RDedup { .. } => cmd!("rdedup --dir {REPO} store {i:02} < /home/user/dedup-bench/input-fs/{i:02}"),
Method::RDedup { .. } => cmd!("rdedup --dir {REPO} load {i:02} > /home/user/dedup-bench/input-fs/data"),

My dedupper is still the best

@ThomasWaldmann
Copy link
Member

OK. Maybe we should get back to the general topic of borgbackup. :-)

So, after all the research you did, did you find anything in borg missing or wrong (besides parallelism, for which we already have a ticket)?

@safinaskar
Copy link
Author

I have nothing to add to actionable items listed here: #7674 (comment)

You said (on adding new hash algorithms): "the code structure in borg dealing with this does not scale well to a lot of algorithms". Unfortunately (if you really care about hash collisions) you should regularly migrate to new hash algorithms as described here: https://valerieaurora.org/hash.html . So, such infrastructure should be eventually added

@safinaskar
Copy link
Author

Also, as well as I understand, borg requires locking, so you cannot run two borg2 create commands in the same time. But this means that you cannot backup two hosts in parallel while exploiting duplication between these hosts.

But in azwyon I was able to implement consistent parallel make (i. e. borg2 create) using O_TMPFILE tricks as described here: #7674 (comment) . With very small amount of code.

But azwyon doesn't support deleting of blobs. But duplicacy (don't confuse with duplicity and duplicati) supports lock-free parallel adding and deleting of blobs (as well as I understand). Here is how they achieved this: https://github.com/gilbertchen/duplicacy/wiki/Lock-Free-Deduplication . This can be very useful addition to borg.

Also, rdedup supports asymmetric cryptography: https://github.com/dpc/rdedup/wiki/My-original-use-case-%28old-README.md%29

@safinaskar
Copy link
Author

Last benchmark compares fixed sized chunking to CDC, this is not good. Also, CDC-based tools with different average chunk sizes compared to each other, this is not good, too. So here are new rules:

  • Benchmark fixed sized deduppers with chunk size 4194304 bytes with compression zstd level 3. In all other regards tools are run in their fastest possible mode. If a tool can be parallel, it runs parallel, etc
  • Benchmark CDC-based deduppers with average chunk size 64 KiB and 4096 KiB with compression zstd level 3. Again, all other settings are fastest possible

Here are results:

Name                                                                                                                                Size     Total ctime     Total dtime      Last ctime      Last dtime
-
fixed chunk 4194304 bytes
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_956_595      135.743906      322.220078       11.352489       33.698906
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false }                                          2_838_121_003       34.133136       48.650205        2.729053        4.986763

cdc 64 KiB
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_166_931_998      226.121918      333.174844       21.173136       34.712557
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1192.330055      226.266151      121.339989       23.731491
Desync { chunker_params: "16:64:256" }                                                                                     2_138_317_619      271.476648       88.211184       27.106573        9.041686
RDedup { chunker_params: "fastcdc", chunk_size: "64K" }                                                                    2_244_051_351     1398.161566      264.857065      102.453551       27.988610

cdc 4096 KiB
Borg { chunker_params: "buzhash,20,23,22,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                3_016_091_076      244.211777      352.128499       22.203505       37.025217
Casync { chunker_params: "1048576:4194304:16777216" }                                                                      3_186_350_053     1234.660061      242.058550      124.582949       25.762375
Desync { chunker_params: "1024:4096:16384" }                                                                               3_191_746_029      300.476295      105.383257       29.178904       11.157104
RDedup { chunker_params: "fastcdc", chunk_size: "4096K" }                                                                  3_009_007_386      111.050768      271.345209        9.683813       27.996724

I leaved zstd, because it has no concept of chunking. I leaved zpaq, because, well, I don't want to play with it. Zpaq has many settings, it is even possible it can beat other tools with right settings, but I don't want to spend my time on it

@dbaarda
Copy link

dbaarda commented Jul 30, 2023

Note for most CDC implementations the "average" avg-size setting is not really the average blocksize, it's the average extra length after the min-size. So the real average blocksize (ignoring truncation effects from max-size) is min-size+avg-size. In general you want to set max-size large enough to avoid truncation effects.

For a given real target average blocksize the best de-duplication and speed settings are min-size=avg-size=average/2, max-size>=2.5*average. This really does mean the optimal avg-size is the same as the min-size. Some CDC implementations have a (sub-optimal) constraint that min-size<avg-size, so you have to set min-size one smaller.

@safinaskar
Copy link
Author

Okay, so here is list of Github issues I spammed wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage.
#7674
systemd/casync#259
folbricht/desync#243
ipfs/specs#227
dpc/rdedup#222
opencontainers/umoci#256

@safinaskar
Copy link
Author

safinaskar commented Aug 10, 2023

I again summarized my list of problems with existing deduppers: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl

@ThomasWaldmann
Copy link
Member

@safinaskar That post has some issues and is not true as it is now.

https://borgbackup.readthedocs.io/en/stable/usage/init.html?highlight=blake2 there it talks quite a bit about speed of hash algos, so it is documented and discoverable.

Also "uses slow sha256" is not true on modern machines - with sha2 hw acceleration, sha256 is faster than blake2, maybe even faster than blake3 would be?

The hash verification at extraction time is done for security reasons, you omitted that in your post.

IIRC I also told you already why we did not add blake3 yet, one main reason being the dependency on rust and that the C stuff is still beta/experimental there and not on pypi IIRC.

So, please fix that post.

@safinaskar
Copy link
Author

You gave link to borg1. Yes, I see phrase "which makes authenticated-blake2 faster than none" there. But docs for borg2 do not have such phrase ( https://borgbackup.readthedocs.io/en/2.0.0b6/usage/rcreate.html ). (I updated my post.)

Also "uses slow sha256" is not true on modern machines

I updated, too.

The hash verification at extraction time is done for security reasons

I think this is obvious.

also told you already why we did not add blake3 yet

Yes, but blake3 support is still absent. I don't say that borg is wrong in not supporting blake3. I just told in that post that borg doesn't support blake3

Also, I can give you invite on lobste.rs if you want

@ThomasWaldmann
Copy link
Member

@safinaskar borg2 has borg benchmark cpu, where every user can easily find out what's fastest on a specific machine (on bare hw or in a VM).

That made a lot of related docs superfluous and is easier for the users (e.g. sha2-support can be tricky: you might have a modern cpu with sha2, but run borg in a VM that does not pass through sha2).

@dbaarda
Copy link

dbaarda commented Aug 15, 2023

FTR, I did some analysis/tests of CDC and rollsums, and in particular FastCDC here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst

Important to note is that FastCDC has some problems, and is actually worse than a simple chunker. Specifically they claim 2 improvements; a better/faster match criteria, and "normalized chunking" giving a better size distribution that allows for more "cut point skipping" speedups without hurting deduplication. Neither of these are that great.

First, they realised that the Gear rollsum only has entropy from the most recent bytes in the least-significant-bits, but they didn't realise that it shifts and accumulates the entropy towards the most significant bits. So their complicated zero-padded mask used for their !(hash & mask) check is actually worse than just using the N most significant bits, which can also be done using an equally fast hash < THRESHOLD check where THRESHOLD = 2^32 / tgt_len, which also works for non-power-of-2 target sizes.

Second, their complicated "normalized chunking" actually gives worse deduplication than a simple chunker with the same amount of cut-point-skipping speedups. Any sort of normalization makes chunk cut-points more dependent on where the previous cut-point was, which messes with re-synchronising the cut-points after a delta, which adversely affects deduplication. All their measured "improvements" were an artefact of the settings they chose producing a smaller average chunk size. A simple chunker with settings to give the same min and average chunk size is simpler/faster with better deduplication.

It also turns out that a simple chunker performs really well with a large min-size (lots of cut-point-skipping speed improvement), and most people using simple chunkers are probably using settings that are sub-optimal for speed and deduplication.

@Glandos
Copy link

Glandos commented Oct 2, 2023

And there is not only big fat hardware. Here is the result with borg benchmark cpu on my venerable Intel(R) Atom(TM) CPU D2550 @ 1.86GHz which is still my main home server:

Chunkers =======================================================
buzhash,19,23,21,4095    1GB        7.019s
fixed,1048576            1GB        0.647s 
Non-cryptographic checksums / hashes ===========================
xxh64                    1GB        1.948s
crc32 (zlib)             1GB        1.775s                                                              
Cryptographic hashes / MACs ====================================
hmac-sha256              1GB        11.377s                                                             
blake2b-256              1GB        6.924s                                                              
Encryption =====================================================
aes-256-ctr-hmac-sha256  1GB        72.034s                                                             
aes-256-ctr-blake2b      1GB        73.775s                                                             
aes-256-ocb              1GB        51.841s                                                             
chacha20-poly1305        1GB        9.897s                                                              
KDFs (slow is GOOD, use argon2!) ===============================
pbkdf2                   5          1.760s                                                              
argon2                   5          2.626s                                                              
Compression ====================================================
lz4          0.1GB      0.082s
zstd,1       0.1GB      0.451s
zstd,3       0.1GB      0.712s
zstd,5       0.1GB      12.971s
zstd,10      0.1GB      15.028s
zstd,16      0.1GB      49.065s
zstd,22      0.1GB      84.937s
zlib,0       0.1GB      0.293s
zlib,6       0.1GB      12.918s
zlib,9       0.1GB      13.373s
lzma,0       0.1GB      103.726s
lzma,6       0.1GB      140.548s
lzma,9       0.1GB      114.000s
msgpack ========================================================
msgpack      100k Items 2.365s

So yes, very different result. Chaha20 is very fast. Blake2b is much faster than hmac-sh256, because I don't have HW acceleration. I would love seeing any performance increase in computation time in any of these categories :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants