Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRC-32 used as undocumented default #57

Closed
todd-a-jacobs opened this issue May 22, 2023 · 10 comments
Closed

CRC-32 used as undocumented default #57

todd-a-jacobs opened this issue May 22, 2023 · 10 comments

Comments

@todd-a-jacobs
Copy link

I have a job running using the following syntax:

zpaqfranz a foo.zpaq foo -m5 -verbose -xxh3 -pakka

The running job is reporting:

Integrity check type: XXH3+CRC-32

The use of CRC-32 isn't specified on the command line, and seems to occur whether or not I specify a specific xxhash or chunked format. For example, leaving off -xxh3 -pakka just results in the output line changing to:

Integrity check type: XXHASH64+CRC-32

instead, which is not what the documentation seems to define as the default either. While I can see why the default of -xxhash would default to xxhash64 on a 64-bit system, I'm not sure why CRC-32 is being calculated or why it is a default, especially on a 64-bit system where 32 bits would seem to invite collisions.

If you just want a fast default to add, why not use MD5, which (while cryptographically weak) is at least 128 bits? This seems like either an error in the documentation, an error in the defaults, or a sub-optimal choice for a fast and well-supported checksum.

@fcorbelli
Copy link
Owner

One of the main differences between zpaqfranz and zpaq is the existence of whole file-size checksums (in fact there are even bigger differences for new tar-like format, to be completed)
In zpaq, there is no checksum (or hash) of a file, only SHA1 of its individual fragments
https://encode.su/threads/3508-Compute-overall-SHA1-from-a-SHA1-series-of-fragments

Sadly this ensures that any SHA1 collisions (there are famous PDF files in this respect) are NOT intercepted by zpaq
Short version: if you archive two files with a SHA1 collision, zpaq don't complain
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable/page2?highlight=sha1+collision
I can make deeper explanations if you want

CRC-32 has one major difference from 'cryptographic' hashes (including MD5): it is computable in disordered and combined portions (aka: fragments)
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable?highlight=sha1+collision
Again, you will find all the details on the development forum, or I can write here.
The short version is that zpaqfranz calculates the CRC32 during the test phase (just the t command, test), with minimal performance impact, and compares it with the CRC32 calculated during the compression phase (in short, those derived from the read file).
https://encode.su/threads/3543-How-to-quickly-compute-CRC-32-of-an-all-zeroed-buffer
You really cannot do this "thing" with deduplication on (that's why there is the w command) with a different hasher (MD5 or whatever).
Simply it is impossible AND you cannot compute with multithread, but only in monotonic single-thread run (this is the p command)

The net result is that SHA1 collisions are detected by zpaqfranz (not correct, detected).
"hidden" changes inside data will be detected too (ex. archiving a in-use file that "someone" will change, like a running VM)
This is because zpaq(franz) tries to archive practically everything it can, whether it is in use or not (after all, it is a software designed for backup, unlike other compressors).
Obviously the probability is small, but it still exists
Example

P:\vm>zpaqfranz t sift.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
sift.zpaq:
1 versions, 9 files, 206.441 frags, 919 blks, 5.274.750.028 bytes (4.91 GB)
To be checked 17.286.397.757 in 8 files (32 threads)
7.15 stage time      21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking            18.315 blocks with CRC-32 (16.713.549.149 not-0 bytes)
ERROR:  STORED CRC-32 2379B9B9 != DECOMPRESSED B67F2325 (ck 00008946) sift/SIFT-Workstation-disk1.vmdk

CRC-32 time           0.45s
Blocks      16.713.549.149 (      18.315)
Zeros             negative (       2.352) 0.156000 s
Total       14.439.568.073 speed 31.735.314.446/sec (29.56 GB/s)
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
--------------------------------------------------------------------------------------
GOOD            : 00000007 of 00000008 (stored=decompressed)
WITH ERRORS

21.875 seconds (000:00:21) (with warnings)

7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
This is the first, the zpaq-based stage test
After that zpaqfranz's kicks in (if any)
In the above example the archive is good (it is extractable) but the archived data is somewhat different

In this case everything is OK

P:\backup>zpaqfranz t www.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
www.zpaq:
1 versions, 2.731 files, 106.338 frags, 523 blks, 6.966.672.085 bytes (6.49 GB)
To be checked 7.378.295.954 in 2.461 files (32 threads)
7.15 stage time      25.17 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking             3.487 blocks with CRC-32 (7.378.213.268 not-0 bytes)
Block 00002K          6.11 GB
CRC-32 time           0.05s
Blocks       7.378.213.268 (       3.487)
Zeros               82.686 (           1) 0.000000 s
Total        7.378.295.954 speed 153.714.499.041/sec (143.16 GB/s)
GOOD            : 00002461 of 00002461 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

Taking CRC32 too slows down the archiving stage, and make a bigger archive
It is possible to turn off, using "straight" zpaq-style archive with -nochecksum

Since data reliability is more important to me, I use it as the default
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable
So, by default, you get THREE different test

  1. SHA-1 of fragments
  2. CRC-32 of the all file
  3. XXHASH64 of all file (or different one, using for example -blake3 or whatever)

PS -pakka change only the output, it is an interface for Windows' GUI. Essentially writes less information

@fcorbelli
Copy link
Owner

PS now it is 01:35, later I will fix and explain better, time to... bed :)

@fcorbelli
Copy link
Owner

You can find here, as the very first difference

https://github.com/fcorbelli/zpaqfranz/wiki/Diff-against-7.15:-add

@todd-a-jacobs
Copy link
Author

todd-a-jacobs commented Jun 22, 2023

BTW, I found that Apple Silicon (notably the M2 processor) seems to be hardware-optimized for SHA-256. When I ran the zpaqfranz benchmarks, even against a terabyte or two, SHA-256 performed in your benchmark at about the same speed as XXHASH3. It might make sense to check for hardware acceleration and use SHA-256 as a default instead of XXHASH3 when the performance is going to be roughly the same since SHA-256 is cryptographically strong while the various XXHASH algorithms don't have any cryptographic properties at all.

Since I don't know how to the benchmarks are done, this may not actually be representative of real-world speeds. Still, it's at least worth thinking about since a number of other platforms also now include some form of AES hardware to speed up AES cipher operations.

@fcorbelli
Copy link
Owner

You can see "under the hood" with a

zpaqfranz b -debug

(...)
Free RAM seems 43.218.018.304
1838: new ecx 2130194955
1843: new ebx 563910569
SSSE3 :OK
SSE41 :OK
SHA   :OK
SHA1/2 seems supported by CPU

You need 3 "OK" to "automagically" get HW acceleration.
Rare, very rare, with Intel CPUs.
Sadly I do not like Macs very much (almost always... I use the terminal just like a FreeBSD box :)

The benchmark is very, very rude, just a quick check to get some infos on VPS' CPUs
I see your point, but the default is XXHASH (64 bit), not XXH3 (128 bit) to do not "cook off" 32 bit CPUs (not every silicon is 64 bit)
With zpaqfranz you can choose... whatever you like (almost everywhere, some exception for md5)

@fcorbelli
Copy link
Owner

fcorbelli commented Jun 22, 2023

PS this is a "real world" example of a Intel-based server, with proxmox+FreeBSD VM, running on HDD

root@franco:/home/mog1/copie # zpaqfranz versum "./*.zpaq" -checktxt
zpaqfranz v58.4q-JIT-L,HW SHA1/2,(2023-06-22)
franz:versum                                    | - command
franz:-checktxt
66265: Test MD5 hashes of .zpaq against _md5.txt
66136: Searching for jolly archive(s) in <<./*.zpaq>> for extension <<zpaq>>
66288: Bytes to be checked 250.899.885.678 (233.67 GB) in files 2
009% 00:29:23 (  22.19 GB) of ( 233.67 GB)          128.799.140/SeC

As you can see the "real" bandwidth of the drive is about 128MB/s, even a 10GB/s hasher will gain nothing

@ruptotus
Copy link

One of the main differences between zpaqfranz and zpaq is the existence of whole file-size checksums (in fact there are even bigger differences for new tar-like format, to be completed) In zpaq, there is no checksum (or hash) of a file, only SHA1 of its individual fragments https://encode.su/threads/3508-Compute-overall-SHA1-from-a-SHA1-series-of-fragments

Sadly this ensures that any SHA1 collisions (there are famous PDF files in this respect) are NOT intercepted by zpaq Short version: if you archive two files with a SHA1 collision, zpaq don't complain https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable/page2?highlight=sha1+collision I can make deeper explanations if you want

CRC-32 has one major difference from 'cryptographic' hashes (including MD5): it is computable in disordered and combined portions (aka: fragments) https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable?highlight=sha1+collision Again, you will find all the details on the development forum, or I can write here. The short version is that zpaqfranz calculates the CRC32 during the test phase (just the t command, test), with minimal performance impact, and compares it with the CRC32 calculated during the compression phase (in short, those derived from the read file). https://encode.su/threads/3543-How-to-quickly-compute-CRC-32-of-an-all-zeroed-buffer You really cannot do this "thing" with deduplication on (that's why there is the w command) with a different hasher (MD5 or whatever). Simply it is impossible AND you cannot compute with multithread, but only in monotonic single-thread run (this is the p command)

The net result is that SHA1 collisions are detected by zpaqfranz (not correct, detected). "hidden" changes inside data will be detected too (ex. archiving a in-use file that "someone" will change, like a running VM) This is because zpaq(franz) tries to archive practically everything it can, whether it is in use or not (after all, it is a software designed for backup, unlike other compressors). Obviously the probability is small, but it still exists Example

P:\vm>zpaqfranz t sift.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
sift.zpaq:
1 versions, 9 files, 206.441 frags, 919 blks, 5.274.750.028 bytes (4.91 GB)
To be checked 17.286.397.757 in 8 files (32 threads)
7.15 stage time      21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking            18.315 blocks with CRC-32 (16.713.549.149 not-0 bytes)
ERROR:  STORED CRC-32 2379B9B9 != DECOMPRESSED B67F2325 (ck 00008946) sift/SIFT-Workstation-disk1.vmdk

CRC-32 time           0.45s
Blocks      16.713.549.149 (      18.315)
Zeros             negative (       2.352) 0.156000 s
Total       14.439.568.073 speed 31.735.314.446/sec (29.56 GB/s)
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
--------------------------------------------------------------------------------------
GOOD            : 00000007 of 00000008 (stored=decompressed)
WITH ERRORS

21.875 seconds (000:00:21) (with warnings)

7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any) This is the first, the zpaq-based stage test After that zpaqfranz's kicks in (if any) In the above example the archive is good (it is extractable) but the archived data is somewhat different

In this case everything is OK

P:\backup>zpaqfranz t www.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
www.zpaq:
1 versions, 2.731 files, 106.338 frags, 523 blks, 6.966.672.085 bytes (6.49 GB)
To be checked 7.378.295.954 in 2.461 files (32 threads)
7.15 stage time      25.17 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking             3.487 blocks with CRC-32 (7.378.213.268 not-0 bytes)
Block 00002K          6.11 GB
CRC-32 time           0.05s
Blocks       7.378.213.268 (       3.487)
Zeros               82.686 (           1) 0.000000 s
Total        7.378.295.954 speed 153.714.499.041/sec (143.16 GB/s)
GOOD            : 00002461 of 00002461 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

Taking CRC32 too slows down the archiving stage, and make a bigger archive It is possible to turn off, using "straight" zpaq-style archive with -nochecksum

Since data reliability is more important to me, I use it as the default https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable So, by default, you get THREE different test

1. SHA-1 of fragments

2. CRC-32 of the all file

3. XXHASH64 of all file (or different one, using for example -blake3 or whatever)

PS -pakka change only the output, it is an interface for Windows' GUI. Essentially writes less information

Hello,

I have a question about "t" command. Is there some bug or I should worry about may data?

My use case.

On Windows Server 2019 I have DB2 database. I do dump daily and store it in zpaq file using just "a" command without any switches. On that machine I use version "zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)".

When I test on that server all is OK.

PS C:\instalki\zpaq715> .\zpaqfranz.exe t C:\KOPIE\backup.zpaq
zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)
C:/KOPIE/backup.zpaq:
15 versions, 15 files, 608.189 frags, 2.948 blks, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (24 threads)
7.15 stage time     353.92 no error detected (RAM ~385.55 MB), try CRC-32 (if any)
Checking           512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)

CRC-32 time          49.52s
Blocks     429.318.293.888 (     512.717)
Zeros       12.970.456.704 (      22.497) 6.616000 s
Total      442.288.750.592 speed 8.929.173.492/sec (8.32 GB/s)
GOOD            : 00000015 of 00000015 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

403.656 seconds (000:06:43) (all OK)

Then I transfer archive to other computer (I use filezilla resume option to download only new data).

Other computer is Windows 10 with zpaqfranz version "zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)"

And test results are:

PS K:\dir> zpaqfranz.exe t .\backup.zpaq
zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)
./backup.zpaq:
15 versions, 15 files, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (4 threads)
7.15 stage time     202.50 no error detected (RAM ~64.26 MB), try CRC-32 (if any)
Checking           512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)
ERROR:  STORED CRC-32 77532235 != DECOMPRESSED C9E7F911 (ck 00019254) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230324223027.001
ERROR:  STORED CRC-32 E3A7A2B1 != DECOMPRESSED B976C123 (ck 00021111) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230331223032.001
ERROR:  STORED CRC-32 9AEAF621 != DECOMPRESSED 9849CEC6 (ck 00022809) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230407223028.001
ERROR:  STORED CRC-32 55371556 != DECOMPRESSED DFE0B566 (ck 00024730) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230416223027.001
ERROR:  STORED CRC-32 A09E6044 != DECOMPRESSED BB230EE2 (ck 00036567) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230506223028.001
ERROR:  STORED CRC-32 A41EC1BA != DECOMPRESSED 93C71B3A (ck 00039680) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230516223032.001
ERROR:  STORED CRC-32 49EA704B != DECOMPRESSED 33E65072 (ck 00040869) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230521223034.001
ERROR:  STORED CRC-32 DDA93190 != DECOMPRESSED 3DD784C3 (ck 00041314) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230526223034.001
ERROR:  STORED CRC-32 0B35FBDC != DECOMPRESSED 2DD981BB (ck 00040476) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230604223041.001
ERROR:  STORED CRC-32 CB5A36E3 != DECOMPRESSED 2480F3C8 (ck 00042183) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230609224237.001
ERROR:  STORED CRC-32 0082630B != DECOMPRESSED C2396AE7 (ck 00042683) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230615223050.001
ERROR:  STORED CRC-32 E80A5C1D != DECOMPRESSED 2F7E5B91 (ck 00042703) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230701223027.001
ERROR:  STORED CRC-32 AC3369F0 != DECOMPRESSED 47B5DD58 (ck 00043742) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230709223032.001
ERROR:  STORED CRC-32 3DFB85F8 != DECOMPRESSED D0F06B3B (ck 00010772) C:/kopie/DATABASE.0.DB2.DBPART000.20230320223031.001
ERROR:  STORED CRC-32 2984D169 != DECOMPRESSED 8A141274 (ck 00043824) Z:/kopie/DATABASE.0.DB2.DBPART000.20230616233029.001

CRC-32 time         104.83s
Blocks     429.318.293.888 (     512.717)
Zeros             negative (      72.961) 72.663000 s
Total      150.007.756.078 speed 1.430.975.742/sec (1.33 GB/s)
ERRORS        : 00000015 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WITH ERRORS

307.406 seconds (000:05:07) (with warnings)

But when I extract one file for example "C:/KOPIE/DATABASE.0.DB2.DBPART000.20230709223032.001", and get crc32 hash manually with zpaqfranz I get good stored checksum

PS C:\Users\user> zpaqfranz.exe sum D:\test\DATABASE.0.DB2.DBPART000.20230709223032.001 -crc32
zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)
franz:sum                                       1 - command
franz:-crc32
Getting CRC-32 ignoring .zfs and :$DATA

No multithread: Found (28.98 GB) => 31.112.585.216 bytes (28.98 GB) / 1 files in 0.015000
|CRC-32: AC3369F0 [     31.112.585.216]     |D:/test/DATABASE.0.DB2.DBPART000.20230709223032.001

214.296 seconds (000:03:34) (all OK)

Extracted hash AC3369F0 is equal with stored hash from "t" command

Also I calculated SHA256 checksum of extracted dump file and original file on the server and they are the same. So can I believe that stored file in zpaq archive is good?

PS1. During writing this comment I also downloaded zpaqfranz exe version from server and test is good:

PS C:\Users\user\Desktop> .\zpaqfranz.exe t K:\dir\backup.zpaq
zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)
K:/dir/backup.zpaq:
15 versions, 15 files, 608.189 frags, 2.948 blks, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (4 threads)
7.15 stage time     199.73 no error detected (RAM ~64.26 MB), try CRC-32 (if any)
Checking           512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)

CRC-32 time          32.25s
Blocks     429.318.293.888 (     512.717)
Zeros       12.970.456.704 (      22.497) 4.292000 s
Total      442.288.750.592 speed 13.707.154.386/sec (12.77 GB/s)
GOOD            : 00000015 of 00000015 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

232.032 seconds (000:03:52) (all OK)

So maybe there is some bug in "t" command on newer versions? Or incompatibility in archive format?

PS2. Also thank You for fantastic job continuing developing zpaq. I used original zpaq for years and it was a nice found that someone continue the job ^_^

@fcorbelli
Copy link
Owner

It is a known bug, for file size (in decimal) longer than 10 chars
You can get the latest nightly build from http://www.francocorbelli.it/zpaqfranz with the bug fixed and the new -fasttxt magic computation of full archive CRC-32

@fcorbelli
Copy link
Owner

@ruptotus
Copy link

Thank You for quick response and... fixing release already ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants