Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to detect hash collision #170

Closed
pcgalando opened this issue Aug 29, 2015 · 53 comments
Closed

Option to detect hash collision #170

pcgalando opened this issue Aug 29, 2015 · 53 comments
Assignees

Comments

@pcgalando
Copy link

Hash collisions are unlikely when using SHA256, but as the number of files/chunks increases so does the probability, most de-duplication system do some sort to sanity check; file based - check file size and possibly type, block based - either generate fingerprint in addition to hash or read the block already in archive storage (ie: paranoid mode option in Symantec PureDisk) to verify the archive data really matches new data before incrementing the link counter and discarding the new data. Easy to test functionality if you are able to change the hash function to md4.

@ThomasWaldmann
Copy link
Member

AFAIK, hash collisions start becoming a real problem if the number of hashes is about 2 ^ (N/2) with N being the bit length of the hash. So for sha256 that is 2 ^ 128 hashes. Each chunk is about 2^16 bytes (default, or likely bigger if you backup a lot of data and adapt --chunker-params). So this means that we talk about 2^144 Bytes of data in your backup. In decimal, that is a 1 followed by 43 zeros.

I guess you'll have a lot of other problems before that happens, right?
Like Borg eating all your RAM, all your disks failing at once, the ceiling falling on your head, the sun exploding, ... :-D

So the question here is really how much overhead one wants to add just to deal with such a improbable case.

Currently we can not easily change the id_hash, but I have plans to make that more flexible (together with more flexible crypto).

@ThomasWaldmann
Copy link
Member

One check I could imagine (not sure if it is done already), is if we detect we already have some specific hash to additionally compare the length of the current input data to the length of the stored data that originally generated that hash. This would add a few bits (8..10?) of safety margin and is likely to be implementable rather cheaply (and also without significant runtime overhead).

Relevant code: cache.Cache.seen_chunk and also .add_chunk. It does not compare the size yet.

BTW, if chunk length matches, this also means either this was a relatively small file and the file size matches (whole file = 1 chunk) or that the rolling hash used for the content defined chunking triggered at the beginning and/or end of the chunk. This seems to add some more bits of safety than the mere length check.

@ThomasWaldmann ThomasWaldmann added this to the 0.26 milestone Sep 5, 2015
@ThomasWaldmann ThomasWaldmann self-assigned this Sep 5, 2015
@ThomasWaldmann
Copy link
Member

@pcgalando ^^^ see pull request.

Guess this is all we should do here.

ThomasWaldmann added a commit that referenced this issue Sep 6, 2015
detect inconsistency / corruption / hash collision, closes #170
edgewood pushed a commit to edgewood/borg that referenced this issue Sep 9, 2015
…#170

added a check that compares the size of the new chunk with the stored size of the
already existing chunk in storage that has the same id_hash value.
raise an exception if there is a size mismatch.

this could happen if:

- the stored size is somehow incorrect (corruption or software bug)
- we found a hash collision for the id_hash (for sha256, this is very unlikely)
@aljungberg
Copy link

Just to rephrase what @ThomasWaldmann said in a slightly different way:

In 2014 the total amount of data produced in the world was 3.5 zettabytes (Whitby, Seagate). Even if you backed up every single byte of information humankind produced (hello NSA), it'd take 22300745198530623141535 years to have a 50% chance of SHA256 a collision.

Let's meet here again in a billion years and review if any changes are needed.

@RonnyPfannschmidt
Copy link
Contributor

given the exponential growth of things, a change of hash algorithm is potentially helpfull in a few decades

@enkore
Copy link
Contributor

enkore commented Feb 23, 2017 via email

@ThomasWaldmann
Copy link
Member

Quoting myself:
"Each chunk is about 2^16 bytes (default, or likely bigger if you backup a lot of data and adapt --chunker-params)."

Update:
That was the old "like attic" chunksize. Since borg 1.0, we target larger chunks of 2^21 bytes by default.

@sebres
Copy link

sebres commented Dec 17, 2019

So this means that we talk about 2^144 Bytes of data in your backup.

No, this is not quite correct.
Many people misunderstand the birthday paradox if it'd be applied in fewer generic sense. It says "the expected number of N-bit hashes that can be generated before getting a collision is not 2N, but rather only 2​N⁄2".
Here applied, it can tell us that 2128 entries are suffice to "ensure" a collision occurs with even chance; But it definitely leaves open the possibility that N/2 is 127 or less could also work (with falling probability).

Also note that this is something about ideal generic case...
But (there is always some "but"):

  1. Firstly the "solution" of birthday paradox tells about generalized random numbers, which the chunks are definitely not.
  2. Secondly it is more about perfect hash (in generic sense), and SHA-256 cannot be considered as such function, especially built over chunks of files with arbitrary content (see P.3) and "compressing" 16777216 bits (221 bytes, size of chunk) to 256 bits (hash).
    You'd be not able to provide mathematical proof for that.
  3. Because the deduplication process working with the chunks that may be parts of the text-files too (not necessarily a binary data) you're actually working on a sub-set in generic sense, for example hashing over a data set of bytes (256 = 28 per token) produces an order of magnitude fewer collisions as hashing over some set of ascii-chars (26+10+x per token), let alone unicode data (where often every 2nd byte is 0).

So the "quality" of data set is often more important as its quantity.

Thus your supposed 2128 can be abrupt 232 or even still smaller.

As counterexample for your theory please see this article (more as 400 collisions were found for MM3-128 on 2 millions paths).
Although MM3 is not cryptographic strong hash, but has very similar distribution (and dispersion) quality as SHA.
Related to above mentioned problem by N=128 we'd expect 264 entries in generalized case. But what we observe in the reality is - there are some data sets where fewer as 218 unique entries (file paths) needed to catch first collision.
Hopefully, I should not explain you the difference (order of magnitude) between 264 and 218?

So the question here is really how much overhead one wants to add just to deal with such a improbable case.

BTW, related "such a improbable case" - please see above.

Which overhead exactly do you mean?
As for overhead to implement - the algorithm is pretty simple - if you found a chunk with same hash (and if suggested option is set), simply retrieve stored chunk(s) and compare the content of chunks with this hash (then if different store it as a sequence / list), but it is extremely rare case either (and related to your "theory" even improbable;).
And overhead to compare it should not be so large (one chunk is still in cache, because you have calculated a hash right now).

Since borg 1.0, we target larger chunks of 2^21 bytes by default.

You say, as would it be an advantage, where in sense of "perfect" hashing over variable data-set with unknown quality it is really questionable (rather a disadvantage).
Note that more interesting is the question how many entries (and hashes) you'd have in some abstract "whole" set, if you built a not perfect hash over some sub-set.

Anyway, I would like to ask - how one can completely disable the deduplication? (because as long as suggested option is not available, I don't see it as safe and cannot really use borg for backup purposes).

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Dec 17, 2019

Borg stores data chunks into a key/value store using MAC(plaintext) as the key.
MAC is resulting in 256bits. The key of that key/value store is 256bits long (no more, no less).

So what is your suggestion for storage of plaintext that results in a MAC collision with a different plaintext?

@ThomasWaldmann ThomasWaldmann removed this from the 0.26 milestone Dec 17, 2019
@ThomasWaldmann
Copy link
Member

a912c02

this was the changeset that closed this issue - it is doing an additional length-check comparing the current chunk with hash X with the length of the stored chunk with hash X.

we can easily and efficiently do that, because we have hash and size in the local in-memory chunks cache and we do a lookup into that chunks cache anyway to detect whether we already have the data for hash X. a different length would mean we have a collision, but not vice versa, of course.

otoh, the suggestion by @sebres to compare the current chunk data to the stored chunk data would be very expensive as the storage is often remote and the data would need transferring, authentication, decryption, decompression and byte-for-byte comparison. and this is not a rare case (only happening for collision scenario), but it would be needed for the very frequent real-duplicate scenario also - before you have done it, you do not know whether it is a duplicate or a collision.

@sebres
Copy link

sebres commented Dec 18, 2019

Borg stores data chunks into a key/value store using MAC(plaintext) as the key.

So and? I wanted just point to the subset issue by the hashing... Through certain "compression" of large number of bits to small number of bits (what the hashing doing), you'd restrict the whole data set (the hash can be applied to) to a sub set related to the quality of data set and its size and the hash size (no matter which function is used, MAC or whatever, as long as it is not PHF).
If H(plaintexti) == H(bink) == PHF(n), where n is [1..2256], and i [1..I], k [1..K] and I < K (in fact extremely due to multiplication).
Then the probability of a collision between any H1 and H2 over set (I) is many many times larger as same hash over set (K) or any two PHF(n).
Here is valid the same logic as to calculate variativity by playing only with 5 and with 32 cards.

we have hash and size in the local in-memory chunks cache...

the storage is often remote and the data would need transferring, authentication, decryption, decompression and byte-for-byte comparison...

But do you not have the original chunks (as part of file it was created from)?
If don't, another option would be to send chunk and do all this remotely.

If the encryption is deterministic, the same verification process is also possible with encrypted/compressed data (don't need to decompress/decrypt it).

@jdchristensen
Copy link
Contributor

@sebres is wrong in his descriptions. Properties of the files have no effect on the chance of collision. And people are more generally misunderstanding how small the chance of collision is. Even with billions of chunks, the chance of collision is much, much, much smaller than the chance that an asteroid hits the earth and destroys all life in the next second. Borg does not need to worry about 256-bit hash collisions, and it would be a waste of effort to compare chunks against each other when the hashes match, something that happens very frequently when using borg.

@sebres
Copy link

sebres commented Dec 18, 2019

Properties of the files have no effect on the chance of collision.

And you can surely explain (or better provide a mathematic proof) how depending to dataset properties expected 264 going to 218 in the practice.

And people are more generally misunderstanding how small the chance of collision is.

Oh, really?
But this is already happened several times (intended and not). See here for example (sure here it was SHA1, but 160 bits or even due to BP 280 is also very large number).
How so by this microscopic chance?

@kroki
Copy link

kroki commented Dec 19, 2019

All the comments above (and likely future comments below as well as people never learn) against collision detection use wrong assumption: if low probability then won't happen soon.

If you roll a dice, the probability of "6" is less then the probability of "not 6". This doesn't mean that "6" won't happen soon, just that it is less likely than "not 6". Still "6" can happen on the very first roll. Now, if you have a dice with 2^256 sides, mechanics is the same: it's highly unlikely, but it may happen nevertheless. The "dice" does not know the results of previous rolls, nor is it anchored somehow in abstract "today", so any "soon" and "distant future" are irrelevant.

So the real question is not whether it happens or not, but what to do when it happens. If one looses HDD, and restores all the data from Borg, some people may feel uneasy knowing that there's a non-zero chance that some of the restored files may be corrupt in a subtle way (like injection of random material inside critical data used to operate hazardous equipment, etc.), and there's no way to check that it didn't happen. Unless you took precautions, which users gathering their wisdom from incorrect answers on stackexchange do not take.

One possible workaround would be to checksum all files Borg is going to backup, and include such manifest file to backup. On restore check files manually. It is inherently racy, but so is the backup process itself. It's really Borg's job, but let's point a finger: Restic doesn't do it too!.

To be completely fair any checksum is probabilistic by nature due to pigeonhole principle, so the probability of Borg with 256-bit keys to inject random data in a file is exactly the same as the probability of collision of SHA256 for different files. However in latter case colliding file will unlikely be anything but random noise, while in the injection case it may still look structured and intact. Feel the difference.

Borg's slogan: may backup, may be secure until first deliberate attack, may even restore some files if all the meteorites are in their supposed trajectories. Roll the dice with us!

Please take it easy :)

@kroki
Copy link

kroki commented Dec 19, 2019

exactly the same

Correction: I meant "the math is exactly the same", concrete values depend on number of chunks and files respectively. But the point is that probabilities are of roughly the same magnitude.

@sebres
Copy link

sebres commented Dec 19, 2019

I meant "the math is exactly the same", concrete values depend on number of chunks and files respectively.

Well, not quite correct - the math is more worse here, because of certain "compression" of chunk to the hash in non perfect way.
The unfortunate gambling with bits of chunk by mixing into the small value of hash could have unpredictable effects if calculation over some significant bits leads to "loss" of important bits combination in the resulting hash, so the whole result can collide with another hash over totally different block.

Sure the number 2128 is enormous, but only if one really operates with 2128 entries (e. g. in case of perfect hash function), but that is not true in case of hashing over variable data set with SHA similar algorithms.

But you are right, for someone which drive is currently lost and is discovering that some archive in the backup is "damaged" too it does not really matter how small the probability was exactly.
Individually for him it has been large enough.

@level323
Copy link

(snip)

a912c02

this was the changeset that closed this issue - it is doing an additional length-check comparing the current chunk with hash X with the length of the stored chunk with hash X.
(snip)

@sebres Disclaimer: I am also not a crypto expert. Regarding the above changeset mentioned by @ThomasWaldmann, doesn't this code mean that (perhaps excluding some even rarer edge cases) the only way borg will silently skip a new chunk with a hash that collides with an existing chunk is if-and-only-if the new chunk also happens to be exactly the same size as the existing chunk?

I've tried to understand the concerns you've been trying to raise in this issue, and AFAICT so far you have been talking about probabilities only of a hash collision. For your concerns to be more relevant to the borg code wouldn't you also need to also consider the chunk size checks that borg conducts in the case where borg encounters a chunk hash that already exists?

@sebres
Copy link

sebres commented Dec 19, 2019

I am also not a crypto expert.

The function, used as checksum or else as key of the hash table, has basically nothing with cryptography either (and should also not be cryptographic strong). The distribution and dispersion quality is more important here.

doesn't this code mean that ... the only way borg will silently skip a new chunk with a hash that collides with an existing chunk is if-and-only-if the new chunk also happens to be exactly the same size as the existing chunk?

Sure, and it'd decrease the probability of a collision (in about 221/2, so ca. 1448 times) by variable sizes of files smaller than 221.
But it doesn't matter at all for the set of files with equal sizes or any chunks of files larger than 221.

@enkore
Copy link
Contributor

enkore commented Dec 19, 2019

I've logged into Github and went through 2FA to reply to this thread, and typed a long reply, read it, and discarded it. I can't help but sum my experience reading your points up with a meme.

s

You seem to imply a feasible way to generate collisions for HMAC-SHA-256 exists and also that a way to generate classes of inputs to HMAC-SHA-2 such that a large number of output bits can be controlled without knowledge of the key (which at this time not feasible even for measly HMAC-MD5, and also not for HMAC-SHA1, despite SHA1'ttered; that's because HMAC doesn't rely on preimage resistance or collision resistance to uphold its security). You say using a cryptographic hash for chunk IDs is unnecessary; but that is actually a paramount part of the design.

By the way, a handy approximation for the probability of at least one random collision of a PRF with N output values and k inputs is k²/2N (for large k, N). For example, 1e12²/2**257 is ~4e-54.

@sebres
Copy link

sebres commented Dec 19, 2019

Just as explanation of coming question "why Inf?":
Simply because I cannot currently calculate this number, but by maximal size of chunk 221 bytes, it would be factorial of 2(8*(221)) so the same as:

  • (216777216)!, or even
  • (2562097152)!

@enkore
Copy link
Contributor

enkore commented Dec 19, 2019

The question whether you get collisions hashing every possible chunk (yes, of course) is not relevant. The question whether it is possible to get a collision is not interesting, because obviously you can. The question of interest is how high the probability is. This question has been answered in detail in this issue and the mailing list several times now.

@sebres
Copy link

sebres commented Dec 19, 2019

The number of all chunks is about Σ2^n for n = 0 to 2e8, which is 2^(2e8+1)-1 ~ 10^10^8.

OK, back to the school!
See my "explanation" above.

@enkore
Copy link
Contributor

enkore commented Dec 19, 2019

Care to elaborate where that factorial is coming from? Remember, we are simply counting all possible chunks [1]. And yes, that formula is obviously not correct, though the error is quite marginal.

[1] 256 1 byte chunks + 65536 2 byte chunks + 16777216 3 byte chunks + ... + 256^2e7 20 MiB chunks
i.e. Σ256^n for n = 1 to 2e7; about 256^2e7, about 10^10^8.

@kroki
Copy link

kroki commented Dec 19, 2019

@enkore

MAC(ciphertext) and MAC(plaintext) are functionally different: In the former case, using different compression algorithms gives you different chunk IDs...

...if you use MAC(ciphertext) as a key, something that I never implied in my question. You said that MAC(plaintext) is an essential part. I ask why. Why can't we use just any HASH(plaintext) for the key? Why do we need authentication here?

It is also not guaranteed that a given plaintext only generates a single ciphertext

It's absolutely not clear what you mean. I believe there's one-to-one correspondence between plaintext block of 256 bits and its ciphertext block of 256 bits (given fixed key and Nonce/IV of course), otherwise it's not clear how decryption would work.

which is actually caused by a design problem (see docs)

I have read all the docs just recently, but somehow missed this part. Could you please provide more specific reference? Where exactly should I see?

In Borg's case, each plaintext (after compression) generates the same ciphertext

For fixed key and Nonce - sure. For different Nonce - different ciphertext. Do you expect it some other way?

The ciphertext is already authenticated before we get to to verify the ID over the plaintext; this prevents attacks against the decompressor. The IDs are referenced by other chunks, which are authenticated the same way, up to the manifest, which is MAC'd with a separate key. Docs have more details.

But how this protects against hypothetical chunk permutation bug in Borg? What I see is that individual chunks are authenticated (both before and after decryption), and that the list of the supposed order of chunks is also authenticated. What I don't see is any check (cryptographic or not) that Borg actually did assemble chunks in the required order (which would be the essence of the hypothesized bug). I think checksum of the whole file would provide such a check and also catch the case when an event of extremely low but non-zero probability actually realizes itself (I also fail to see why the exact probability computation is important as long as it is established that it is greater than zero, but if you people have fun, then why not?).

But let's do it one step at a time: why can't we use just any HASH(plaintext) for the key? Why do we need it to be MAC?

@sebres
Copy link

sebres commented Dec 19, 2019

OK, let us forget the permutations of N distinct objects here, so let change factorial to sum, it would simply change nothing, because it is not important at all - 10108 is the same as 10100000000, this number has 100M zeros and although it is still not comparable to a Googolplex, but many many times larger as a Googol.
Just as reminder - a Googol is 10100 and looks like:

10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

So we are speaking about Googol1000000.

Just repeat above-mentioned number (zeros of that) 1 million times.

And now compare it with 2256, which is:

115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936

then you'd estimate how many chunks with collisions theoretically exist by hashing using 256-bit hash.
This is pretty clear that this number of colliding chunks is "distributed" over all known "universes", but because the number of possible collisions is so huge, you can never exclude that some two chunks of this sequence don't belong (or wouldn't belong in the future) to some borg-backup.

As long as you cannot do that the decision for the issue should be obvious.

@enkore
Copy link
Contributor

enkore commented Dec 19, 2019

...if you use MAC(ciphertext) as a key, something that I never implied in my question. You said that MAC(plaintext) is an essential part.

Ah, I see. Borg is designed to be deduplicating and assumes f(chunk) == f(chunk') => chunk == chunk'. Using this as a key is the reason it is there. That's why it needs to be a cryptographic hash (to avoid your backups being malleable by having collisions in that particular f); not necessarily a MAC (more below).

This immediately implies that the system is probabilistic -- at least on paper -- since the domain of f() is much larger than its image.

Why can't we use just any HASH(plaintext) for the key? Why do we need authentication here?

It's not used for authentication; if you replaced HMAC() with HASH() (plain SHA-2, say) in an encrypted repository it would still be secure (because the manifest is HMAC'd, and contains the hashes of the archives and so on).
However, if you use HASH(plaintext), you are telling the remote server a plain hash of all chunks; for small files you are thus just telling the server the hash of files you have; and for larger files it is still trivial to test for their presence in your backup. That's not a nice property for a supposedly encrypted backup.

It's absolutely not clear what you mean. I believe there's one-to-one correspondence between plaintext block of 256 bits and its ciphertext block of 256 bits (given fixed key and Nonce/IV of course), otherwise it's not clear how decryption would work.

For a block cipher you're absolutely correct. But when we are talking about a plaintext or a ciphertext in this context, we (I) didn't mean a single block, but rather the entire chunk, up to ~20 MiB in length. In Borg's case AES-CTR is used to encrypt that, using a global counter as a nonce (so it actually isn't deterministic w.r.t. to the plaintext). A different scheme might use a random nonce; every time you'd encrypt the same plaintext, you'd get a different ciphertext. Yet another scheme might be fully deterministic, same plaintext results in always the same ciphertext.

For fixed key and Nonce - sure. For different Nonce - different ciphertext.

Yup.

But how this protects against hypothetical chunk permutation bug in Borg? What I see is that individual chunks are authenticated (both before and after decryption), and that the list of the supposed order of chunks is also authenticated. What I don't see is any check (cryptographic or not) that Borg actually did assemble chunks in the required order (which would be the essence of the hypothesized bug).

Changing the ordering of chunks in a file would be a very specific attack/change (e.g. from a file consisting of chunks 1, 2, 3 to it consisting of chunks 1, 3, 2 ; each number representing a different (HMAC-)SHA-256 value).

The easier to think about attack is to adversarially generate a chunk that has the same ID as another chunk. E.g. replacing the contents of /bin/sudo with a backdoored version.

I think checksum of the whole file would provide such a check and also catch the case when an event of extremely low but non-zero probability actually realizes itself.

Yes, storing a separate hash of the entire file contents would be possible for Borg to do. However, verifying H(H(part1) || H(part2) || ...) implies the same confidence as verifying H(part1 || part2 || ...) for all practical intents and purposes, unless the particular H used is broken.

@kroki
Copy link

kroki commented Dec 19, 2019

factorial of 2^(8*(2^21))

I too can't follow the factorial part (yep, I didn't pay enough attention at school), however even if we consider just default minimal chunk size of 512KiB = 524288 bytes = 4194304 bits, there are 2^4194304 different 512KiB chunks that map to 2^256 different keys. So each key has at least 2^(4194304 - 256) chunks targeting it. Like with pages of Tolstoy's "War and Peace", most of them are just noise, but some do have meaning, and to my understanding nothing in the Universe prevents them to appear in a single backup.

That's why it needs to be a cryptographic hash (to avoid your backups being malleable by having collisions in that particular f)

I have to refer you to docs:

The attack model of Borg is that the environment of the client process (e.g. borg create) is trusted

So no clients trying to exploit hash malleability.

It's not used for authentication

Well, actually it is used for authentication (see two instances of CONSTANT-TIME-COMPARISON() in Decrypt section - since Borg has it it would be a waste not to use it).

However, if you use HASH(plaintext), you are telling the remote server a plain hash of all chunks

Doesn't Borg encrypt it? IIRC MAC(plaintext) is inside ciphertext (and if not it should - some security experts believe that MAC(plaintext) may reveal some information about plaintext - or so I heard).

Borg's case AES-CTR is used to encrypt that, using a global counter as a nonce (so it actually isn't deterministic w.r.t. to the plaintext)

I'd rather say that it is a deterministic process parametrized by encryption key and Nonce/IV. It simply couldn't be otherwise.

Yet another scheme might be fully deterministic, same plaintext results in always the same ciphertext.

This is simply impossible with CTR mode (Nonce reuse flaw), and highly undesirable in any other mode, so I don't view it as a possibility. The very purpose of Nonce/IV is exactly to avoid this as much as possible.

Changing the ordering of chunks in a file would be a very specific attack/change

I was talking about a hypothetical bug, you again read more than there was.

However, verifying H(H(part1) || H(part2) || ...) implies the same confidence as verifying H(part1 || part2 || ...) for all practical intents and purposes, unless the particular H used is broken.

...or Borg has a bug. However I don't agree with the claim "for all practical intents and purposes", and haven't seen much of a justification from your exchange with Sebres (or I wasn't able to comprehend it), so we are back to where we started.

Thanks for elaborations! :)

@enkore
Copy link
Contributor

enkore commented Dec 19, 2019

The attack model of Borg is that the environment of the client process (e.g. borg create) is trusted

So no clients trying to exploit hash malleability.

The execution environment is trusted, but we do assume that the attacker is able to place arbitrary files in your backup input set, because that is a very common and easy to achieve capability (e.g. web server with a file upload form somewhere).

Doesn't Borg encrypt it? IIRC MAC(plaintext) is inside ciphertext (and if not it should - some security experts believe that MAC(plaintext) may reveal some information about plaintext - or so I heard).

No, MAC(plaintext) is the "public" ID of the object, that will eventually end up at Repository.put.

Yes, a MAC in general may reveal information about its input. HMAC provides guarantees beyond those required of a MAC; it's a PRF (pseudo-random function; making some assumptions about the underlying hash function) and does not leak anything about the input to the output.

This is simply impossible with CTR mode (Nonce reuse flaw), and highly undesirable in any other mode, so I don't view it as a possibility. The very purpose of Nonce/IV is exactly to avoid this as much as possible.

The immediate purpose of not reusing a nonce/IV is to use different keystreams for different messages (in the case of block-based streamciphers like AES-CTR or ChaCha20; similar idea for other modes). As you noted, different modes react differently to nonce/IV reuse. It is not outright trivial but not particularly hard to create deterministic schemes such as AES-SIV. One concern with these is that, since ciphertext is strictly a function of key and plaintext, encrypting the same plaintext multiple times reveals that the ciphertexts belong to the same plaintext (this is called IND-CPA, ciphertext indistinguishability; no deterministic scheme can provide IND-CPA). This is not an issue for Borg, since in that case the IDs would be the same anyway, which we tell the server openly. And also, because of deduplication, we wouldn't generally encrypt the same plaintext twice (except maybe around borg --repair etc.).

AES-SIV btw. uses CTR mode internally; the CTR nonce is essentially the MAC of the plaintext.

However I don't agree with the claim "for all practical intents and purposes", and haven't seen much of a justification from your exchange with Sebres (or I wasn't able to comprehend it), so we are back to where we started.

It pretty much comes down to what we expect of a cryptographic hash: Given an unbroken cryptographic hash function H and two messages msg1, msg2, then H(msg1) == H(msg2) with msg1 != msg2 is improbable. Borg (and pretty much every content-addressed storage system, of which there are many, generally used for large amounts of data) thus assumes H(msg1) == H(msg2) <=> msg1 == msg2. This is a common assumption; digital signatures and much of cryptography by-and-large rely on this assertion.
That's why people are highly motivated to find attacks against H()'s, and to get people to use H()'s that are considered secure.
Fortunately Borg uses HMAC-SHA-2 were it matters; the HMAC construction has been shown to be far more resilient than the underlying hash function.

@sebres
Copy link

sebres commented Dec 19, 2019

Like with pages of Tolstoy's "War and Peace", most of them are just noise

I contradict expressly this sentence - I had read it in original and it is great and even vintage classic from first to last byte. :)

As for the relation N(chunks) vs. N(hashes)... the adherents of "use hash as unique ID without data comparison" (so H(msg1) == H(msg2) <=> msg1 == msg2) would tell you with pleasure, how small the chance of collision is, would compare it with the crash of asteroid to the earth etc pp. And always confuse the usage of hash in cryptographic sense (signing, encrypting, etc) with usage like a checksum or a key of a table.
But every single time would consider only one side of the medal.

@enkore
Copy link
Contributor

enkore commented Dec 19, 2019

Sadly no designer of any content addressed storage system ever stopped to consult the Senior Expert Software Developer :)

Edit: While the discussion with kroki is interesting (but perhaps something better suited for the mailing list) the original issue #170 has been discussed to death even before the current iteration of this ticket and no merit will be found in beating a dead horse (or entertaining an ob(li)vious troll) any further.

[Shouldn't you be using svnhub instead? 🤔]

@sebres
Copy link

sebres commented Dec 19, 2019

Borg (and pretty much every content-addressed storage system, of which there are many, generally used for large amounts of data) thus assumes H(msg1) == H(msg2) <=> msg1 == msg2.

Tell me still one that operates with backup data.

If you mean git and co: then it is pretty simple - yes the assumption is there, but a collision is simply not important:

If two revisions in git have the same hash, it would treat those commits as identical.
But in the unlikely case this is really happening, you could always do amend (go back one commit, and change something like a timestamp) so they wouldn't collide anymore. Or even rebase the branch etc.

In case of broken backup you've no chance to repair it without original data (which you're missing if you currently need a restore process).

@enkore
Copy link
Contributor

enkore commented Dec 20, 2019

Tell me still one that operates with backup data.

Oh I don't know, Centera, Venti, IBM Tivoli, Acronis, Veeam (how do you think client-side dedup works to reduce traffic?), duplicati, bup, restic, Tahoe lafs, tarsnap, IPFS... just off the top of my head.

@sebres
Copy link

sebres commented Dec 20, 2019

how do you think client-side dedup works to reduce traffic

Oh come on.
I'm not interesting to know what you thinking, either what you know.

As for many proprietary products from of that list, you want really tell me, you know its internal architecture? OK.
BTW, I was speaking with famous engineer worked for EMC at some point, he told me "No, not that I know of". So at least by Centera you seem to lie.
Another one would be from NetApp (I can still not reach him).

@sebres
Copy link

sebres commented Dec 20, 2019

Apropos EMC: one customer of me uses Centera Cluster as a storage for his bank's core business, archiving billions of documents (transactions whatever) per year with a retention time up to 30 years.
You want really tell me that the bank (or rather its data center) will really accept any storage system depending on probability, so with some (also if microscopic, but unknown) chance to lose a document? At least this could never going through the certification process ever.

@enkore
Copy link
Contributor

enkore commented Dec 20, 2019

I find it plainly amusing that you are completely oblivious to the BER of processors, memory, storage and interconnects, not to speak of random and correlated failure rates of entire components. But hey, it's digital, right? It's always correct; neither random nor systematic failures occur.

As for many proprietary products from of that list, you want really tell me, you know its internal architecture? OK.

There's that thing called "documentation". Vendors tend to explain how their software does things, because gasp that's relevant to applying the software.

BTW, I was speaking with famous engineer worked for EMC at some point, he told me "No, not that I know of". So at least by Centera you seem to lie.

Imagine trying to call someone liar for saying Centera is a CAS... literally the headline feature of the entire product.


This will be my last reply to you; further attempts at trolling will not be met with a response.

@sebres
Copy link

sebres commented Dec 21, 2019

OK, I'll try to explain it with a combinatorial example (because I assume my math would be not understood again)...

Firstly I'm definitely right with the implication that common value of all possible chunk variants (in all known "universes") affects the calculation, as well with the estimation of its impact to the probability calculation ultimately.

So the enormous relation Googol1000000 to 2256 having huge negative impact to the chance of a collision (and significant enlarges the probability), see formulas below.

As well as that the properties of chunks/files have the effect well (in fact also drastically, I can also explain that later).
To assess these questions my college combinatorics is still good enough.

The question whether you get collisions hashing every possible chunk (yes, of course) is not relevant.

Still again - Wrong. The number of all possible chunks is relevant in probability calculation (see below).

a handy approximation for the probability of at least one random collision of a PRF with N output values and k inputs is k²/2N (for large k, N). For example, 1e12²/2**257 is ~4e-54.

Your approximation is wrong too - simply because the number of every possible chunk (yes, of course) is relevant.

To understand the correct estimation (or else all the dependencies here)... we can use following approximation:

the task is similar to the probability of finding two cards of same type in some small subset of cards we would take from the whole deck. Two different cards of same type means a collision is found.


Let's play cards (we starting firstly with 32 classic cards):

  • the deck contains K (32) cards (approximated to our model - big K would be equal the huge sum of all possible variants of the chunks with some fixed size everywhere, not in backup);
  • we take k (3) cards from the deck (in our model - small k equals number of input chunks to backup);
  • we playing with classic cards with maximal n (4) cards of every type;
  • so N (32/4 = 8) is number of possible variants of all types (in our model - big N is equal the number of hashed values 2256);

The probability to find two cards of the same type (so catch a collision) in some 3 (k) cards from 32 (K) is exactly equal:
(1*n-1)/(K-1) + ((K-n)/(K-1) * (2*n-2)/(K-2)) =
(1*4-1)/(32-1) + ((32-4)/(32-1) * (2*4-2)/(32-2)) =
0.277

The explanation is simply, on iteration number...

  1. with a single card we cannot have a collision, but with 1 card "blocking" 4 (n) cards for future "colliding" type.
  2. with a second card we have a collision in case we have same type as 1st card or if not we'd "reserve" next 4 (n) cards and a chance to catch it at next iteration
  3. with a third card we have a collision if we cross with 1st or with 2nd types of card.

The formula is also growing recursively (by k iterations) doe to probability graph (tree), but this dependency was already pretty clear to everyone.

If one take an attentive look at the formula, one would see that the result will grows if K grows too.
Although in normal case the small n should remain 4 (classic cards), but in our case it will also grows because we have constant big N (our sieve by hashing remaining constantly 2256), so the count of variations n = K/N grows within K (and k) together with growth of K.

This is pretty well visible in the following tables:

k N K n P(collision)
3 1..8 32 4 0.277
3 1..8 64 8 0.312
3 1..8 128 16 0.328
k N K n P(collision)
30 1..800 3200 4 0.339
30 1..800 32000 40 0.415
30 1..800 320000 400 0.423

Thus by constant N (hash size), with every byte you intentionally grow the set K (all possible chunks) and also without of growing of sub-set k (all input chunks in backup), the probability is increased significantly, because the growth of K is exponentially (so for example by 221 bytes, value of K has even reached Googol1000000).
Moreover the impact increasing further not linearly with growing of number of input chunks in backup k.

The numbers speak for themselves and I can only repeat my statement:

Growing of max chunk size this way is questionable ("since borg 1.0, we target larger chunks of 2^21 bytes by default") moreover it is very very dangerous (you enlarge the probability of collision with every new byte in max chunk size).

I can not provide you the exact estimation how large it can be right now (need a supercomputer, or should possibly try to calculate it in CUDA), but I hope you recognize the tendency and understand now the seriousness of the issue.

For people that don't believe the formulas, I wrote a program estimating the probability (iterative) as well as using a recursive calculation (probability tree), which confirms the numbers (I'll provide the link as long as I get pushed it to GH).


And if someone would still one time call me a troll, I'll find him :]

@sebres
Copy link

sebres commented Dec 21, 2019

Here are 2 program variants (in tcl but also in python) - https://github.com/sebres/misc/tree/id-hash-collision
See at end of README.md for usage examples.
And of course don't try to calculate / estimate the probability for above-mentioned huge numbers, it is doomed to fail.

@kroki
Copy link

kroki commented Dec 21, 2019

FYI, in #4884 I tried to summarize main points of this thread with the conclusion that only "weak" solution is possible, and that it's up to proponents of collision awareness to provide the patch. I'm not a Borg user and I won't volunteer, so let me unsubscribe :).

@sebres

I contradict expressly this sentence - I had read it in original and it is great and even vintage classic from first to last byte. :)

I'm ashamed to admit that it is still in my "to read" list.

@enkore

The execution environment is trusted, but we do assume that the attacker is able to place arbitrary files in your backup input set
...
AES-SIV btw. uses CTR mode internally; the CTR nonce is essentially the MAC of the plaintext.

Two valid points indeed.

It pretty much comes down to what we expect of a cryptographic hash: Given an unbroken cryptographic hash function

You wrapped your mind around "attack" scenarios so tightly, that you do not see any other possibilities. The are accidental collisions in the world, without any regard to cryptographic hash properties. Just don't forget to keep that in mind ;).

@kroki
Copy link

kroki commented Dec 21, 2019

Said I'm gone - and back again!

Just another point for @enkore:

Borg (and pretty much every content-addressed storage system, of which there are many, generally used for large amounts of data) thus assumes H(msg1) == H(msg2) <=> msg1 == msg2. This is a common assumption; digital signatures and much of cryptography by-and-large rely on this assertion.

True for some content-addressed storage system (I believe there are some that do check bit-by-bit, and there are some that at least do store full file checksum). However your generalization about encryption is not correct: the goal of cryptography is to protect against deliberate collisions. It does not, nor it really can (leaving aside quantum encryption) protect against me guessing your key by accident. If by accident it so happens that you sign some document with your signature, and I sign my with mine, and such signatures collide bit-by-bit, it's not a security issue. We wouldn't even know that they had collided. Unless we try to put both documents into one content-addressed storage keyed by signatures and someone tries to deliberately exploit the fact.

I have the impression that your subconscious strong objection against everything Sebres says is that you perceive that from his claims it follows it is possible to deliberately find collisions, and that he knows how to do it. I didn't follow everything he said, but I'm sure it's not his claim (though it is true that there are ways to reduce search space when you deliberately trying to create a collision, Google used these in their "shattered" PDFs).

Now I said all I could, gone :).

@sebres
Copy link

sebres commented Dec 21, 2019

I added also my summarizing #4884 (comment) there.
I would be interested to know any argued mind about possible conclusion.
In my option the maximal size of chunks should be drastically reduced (but one can possible optimize the accumulating algorithm using combining or chaining to the chunk blocks like a C-trie, DAG, or some finite state machines like Aho–Corasick, in order to store common ID of chain-block instead of every hash of each single chunk).
And the issue should be reopened, in order to provide some options to disable deduplication or implement verification of chunk content together with the hash for people preferring guaranteed safety of backup to performance or place of storage.

@ThomasWaldmann
Copy link
Member

Just to keep an idea that just came to my mind. It solves the storage part of the collision problem.

For efficiency reasons, we could only apply it for efficiently detected collision cases (EDCC).

EDCC:  H(d) == H(d') and len(d) != len(d')

(d is the current to-be-stored data and d' is some data already stored in the repo. the different length is required to efficiently differentiate a collision from data deduplication, see: a912c02 )

We currently just raise a "oh, we're fscked!" fatal exception in that case (and, AFAIK, nobody has seen it yet).

Instead of that exception, we could do this:

Currently:
we store d (plus compression/encryption/authentication, omitted here)

Instead, we could:
store d+x and len(d) (or even len(x), for brevity)

x would usually be nothing (a zero length byte sequence), but the code would never require that.

In EDCC, x would be chosen to be some generated non-empty string until there is no hash collision any more (not with d' nor with any other stored data d''). We neither want to generate x+d so it results in a real duplicate nor so that it produces another accidental hash collision.

Then we would need to store d+x together with len(d) (or len(x)).

The storage layer would handle that transparently, adding x when storing and removing it when fetching.

This should behave well until too much of the 2^256 key space is occupied (assuming H spreads well enough over that space).

@sebres
Copy link

sebres commented Dec 21, 2019

@kroki

I didn't follow everything he said, but I'm sure it's not his claim

If you mean the math - you can simply download my py- or tcl-program (playing cards;) and try it by yourself with different number of K/k and N (don't try toooo large numbers;).
You'd notice the impact of K to the resulting probability.
If you mean the approximation model at all (how and whether chance to get two cards of same type is applicable to the collision of hash), just ask. I'd be happy to answer every of your question.

@ThomasWaldmann
Copy link
Member

#170 (comment) adding to that:

The sequence of x values tried for collision handling of some problematic data d needs to be reproducible, because the same collision that happened for some backup also might happen again for all subsequent backups (if the problematic data d is still there). If x would be random, we would produce different d+x for each backup.

@sebres
Copy link

sebres commented Dec 21, 2019

AFAIK, nobody has seen it yet

Sure, because en masse the spreading by hash sieve affects more the chunks of equal size (there are many many times more colliding blocks in large chunks as in short chunks).
And because very probably (I don't have a statistic yet, but it said the math) there are too few large chunks that are going "deduplicated" ever in relation to short files.

Instead, we could:
store d+x and len(d) (or even len(x), for brevity)

As already said it would marginal decrease a collision chance (see my last sentence #170 (comment)),
but it doesn't matter at all for the set of files with equal sizes or any chunks of files larger than 221.

assuming H spreads well enough over that space

But unfortunately it does not (as every other adder+modulo hashing algorithm of the world).

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Dec 21, 2019

@sebres it sounds like you assume that the chunker cuts equal sized chunks if the file size is > 2^21. That is not the case, the chunking points depend on the file contents.

There is a minimum and maximum chunk size though, for technical reasons (storage efficiency, buffer size).

@ThomasWaldmann
Copy link
Member

And "well enough" meant "well enough for our purposes" and the purpose is storing way less chunks than 2^256.

@ThomasWaldmann
Copy link
Member

If my idea sounds valid / doable, I would open a ticket for "solving storage of detected hash collisions", meaning the EDCCs.

@sebres
Copy link

sebres commented Dec 21, 2019

There is a minimum and maximum chunk size though, for technical reasons (storage efficiency, buffer size).

This does not really matter, because the max size alone make half of all possible blocks ever (5e108 from 10e108).

and the purpose is storing way less chunks than 2^256.

But as I already revealed above - the max size of chunk so the number of possible chunks (big K) as well as the quality of dataset are also important, especially considering constantly growing of number of stored chunks (small k), which significantly increases the impact of big K to chance of collision with every iteration over probability tree:

id-hash-colprob 32000    30 800
k = 30 cards from K = 32000 max, by sieve N = 1..800, repeated n = 40 times:
Calc-P(collision)=0.41547846266991934

id-hash-colprob 32000    40 800
k = 40 cards from K = 32000 max, by sieve N = 1..800, repeated n = 40 times:
Calc-P(collision)=0.6198046446107234

id-hash-colprob 320000   30 800
k = 30 cards from K = 320000 max, by sieve N = 1..800, repeated n = 400 times:
Calc-P(collision)=0.42258853962267906

id-hash-colprob 320000   40 800
k = 40 cards from K = 320000 max, by sieve N = 1..800, repeated n = 400 times:
Calc-P(collision)=0.6280580556352908

id-hash-colprob 32000000 30 800
k = 30 cards from K = 32000000 max, by sieve N = 1..800, repeated n = 40000 times:
Calc-P(collision)=0.42336511057744053

id-hash-colprob 32000000 40 800
k = 40 cards from K = 32000000 max, by sieve N = 1..800, repeated n = 40000 times:
Calc-P(collision)=0.6289545531811702

Accumulated to the table, the effect by increment of ALL possible (K) of certain size together with growing of input chunks (k) by constant hash size (big N = 800, small n variate due to K/N) or rather its dependency is pretty well obvious:

k K n P(collision)
30 32000 40 0.41548
40 32000 40 0.61980
30 320000 400 0.42259
40 320000 400 0.62806
30 32000000 40000 0.42336
40 32000000 40000 0.62895

Note that by max size 221 the number big K is enormous (5e10000000), and therefor n = K/N is extremely huge too - it is in about 2562097152/2256 == 2216777216-256 that corresponds also to a number with 10M zeros.

So the "well enough" means either "well enough yet".

it sounds like you assume that the chunker cuts equal sized chunks if the file size is > 2^21

More interesting is the question what happens with two large files containing different content but building same hash Ha for first part of file 2^21 bytes, also if hashes for second part(s) Hb and Hc are different?
So for first file we have Ha1|Hb and for second file - Ha2|Hc.
Would borg assume that when Ha1 == Ha2, then this first part is the same chunk (and in result stores it only once)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants