Feedback: Deduplication galore/granularity suggestion #886

Sanmayce · 2017-10-10T14:07:06Z

Hi Yann, hi Stella,
very nice to see this feature, it is a must with these mountains of texts coming unceasingly.

My suggestion/question is about the granularity of deduplication currently implemented.
Haven't looked into source code of any deduplicator, but remember my friend Lasse using 6 years ago, like, 8KB in his eXdupe.
I asked Christian about granularity of RAZOR, he said:

256b: 2048M range
512b: 4096M range
...
64K: 512G range

So, what size are current MinMatches, used for, say, 512MB or 29bits in Stella's etude?

My Zennish textual microdeduplicator uses 128GB window or (5+1)*8-8-3=37bits with these "long" matches:

18:(5+1)=3
32:(5+1)=5.3
48:(5+1)=8
80:(5+1)=13.3

And my suggestion, my experience with e-books dictates deduplication to be done in tiniest orders AS WELL, somewhere between 40 and 80 bytes (talking about 1byte per character).
Simply because a huge textual collection usually contains many different editions (with their respective formatting/wrapping), for example, one sentence could be the same across 7-8 editions of 'Don Quixote' but the actual appearance to be on one physical or two lines. I mean the [CR]LF delimiter breaks the long matches. So, my estimation for order 48 deduplication is roughly 48:6=8:1 compression ratio, whereas the 'bsc' cannot reach full 6:1 for my main corpus 'GAMERA' (336GB strong - pure English text), it suffers from not being able to deduplicate.
My dummy guess is that 40..80 range across 128MB..336GB is not implemented yet, am I wrong?
If we glimpse at the incoming near future we could see already 1TB RAM or 8x128GB kits with AMD Threadripper motherboards. My drift, imagine a specific task of full-text traversing those 336GB IN-MEMORY! Which compressor of today can see backwards "long" matches that far!? NONE!
My wish is one day, say, 5 years past now, monstrous hardware to meet monstrous decompressor, simple.

Nakamichi ‘Dragoneye’ targets textual data - ebooks of any caliber, especially anthologies/encyclopedias and big/huge/tera corpora of texts, yet, it needs many years to compress and I see it already outdated with its 128GB limit 😿 🤣

To see that even 22 volumes of 105MB pure text are not that easy to compress:

D:\Ryuugan+_testdata>zstd-v1.3.2-win64.exe -b22 --ultra --long "Encyclopaedia_Judaica_(in_22_volumes)_TXT.tar"
22#_volumes)_TXT.tar : 107784192 ->  28147968 (3.829),   1.1 MB/s , 485.5 MB/s

D:\Ryuugan+_testdata>zstd-v1.3.2-win64.exe -b22 --ultra "Encyclopaedia_Judaica_(in_22_volumes)_TXT.tar"
22#_volumes)_TXT.tar : 107784192 ->  28049674 (3.843),   1.0 MB/s , 484.6 MB/s

https://twitter.com/Sanmayce/status/917009706761236485

The Linux kernelS benchmark is super, here is my quick test with 1GB of source code, with interesting (unlike English texts but with abundant 8+ long matches across entire pool) histogram:

---------------------------------------------
| Compressor                  |        Size |
|-------------------------------------------|
| RAZOR 512M                  |  91,797,748 |
| RAZOR 32M                   |  95,947,043 |
| GLZA Fast                   | 109,399,317 |
| ZSTD -b22 --ultra           | 114,459,872 |
| ZSTD -b22 --ultra --long=30 | 116,795,556 |
| ZSTD -b22 --ultra --long    | 119,836,913 |
| libdeflate 12               | 186,919,936 |
| lizard 39                   | 227,347,207 |
---------------------------------------------

To me, Lizard 39 (being LZ4 strengthened parsewise, and entropyfied) is kinda micro Zstd, no? Anyway, its 2833 MB/s are supercool - 2x weaker but 3x faster than Zstd "vanilla".

D:\Ryuugan+_testdata>dir

09/22/2017  05:16 AM     1,028,290,560 windows_nt_4_source_code.tar
10/09/2017  11:35 PM           932,533 zstd-v1.3.2-win64.exe
10/10/2017  02:53 AM         1,100,346 zstd-v1.3.2-win64.zip

D:\Ryuugan+_testdata>zstd-v1.3.2-win64.exe -b22 --ultra windows_nt_4_source_code.tar
22#4_source_code.tar :1028290560 -> 114459872 (8.984),   1.5 MB/s , 945.1 MB/s

D:\Ryuugan+_testdata>zstd-v1.3.2-win64.exe -b22 --ultra --long windows_nt_4_source_code.tar
22#4_source_code.tar :1028290560 -> 119836913 (8.581),   2.3 MB/s , 944.5 MB/s

D:\Ryuugan+_testdata>zstd-v1.3.2-win64.exe -b22 --ultra --long=30 windows_nt_4_source_code.tar
22#4_source_code.tar :1028290560 -> 116795556 (8.804),   2.5 MB/s , 956.2 MB/s

D:\Ryuugan+_testdata>rz a -d 512M windows_nt_4_source_code.tar.rz windows_nt_4_source_code.tar

 *** RAZOR Archiver 1.00 (2017-09-08) - DEMO/TEST version ***
 *** (c) Christian Martelock (christian.martelock@web.de) ***

 Scanning d:\ryuugan+_testdata\windows_nt_4_source_code.tar
 -> Overwrite windows_nt_4_source_code.tar.rz (Yes/No/Always/nEver)? y
 Found 0 dirs, 1 files, 1028290560 bytes.

 Creating archive windows_nt_4_source_code.tar.rz
 Window : 524288K (4096M..1024G)
 Header : 62
 Size   : 91797748

 Archive ok. Added 0 dirs, 1 files, 1028290560 bytes.
 CPU time = 2388.984s / wall time = 1967.345s

D:\Ryuugan+_testdata>rz a -d 32M windows_nt_4_source_code.tar.rz windows_nt_4_source_code.tar

 *** RAZOR Archiver 1.00 (2017-09-08) - DEMO/TEST version ***
 *** (c) Christian Martelock (christian.martelock@web.de) ***

 Scanning d:\ryuugan+_testdata\windows_nt_4_source_code.tar
 -> Overwrite windows_nt_4_source_code.tar.rz (Yes/No/Always/nEver)? y
 Found 0 dirs, 1 files, 1028290560 bytes.

 Creating archive windows_nt_4_source_code.tar.rz
 Window : 32768K (256M..64G)
 Header : 62
 Size   : 95947043

 Archive ok. Added 0 dirs, 1 files, 1028290560 bytes.
 CPU time = 1929.125s / wall time = 1617.300s

D:\Ryuugan+_testdata>GLZA.exe c -f windows_nt_4_source_code.tar windows_nt_4_source_code.tar.glza
Converting textual data
cap encoded: 1, UTF8 compliant 0
Found 1115325754 symbols
PASS 367: grammar size 76141322, 6730716 grammar rules
76141321 syms, dict. size 6730716, 16.9828 bits/sym, o0e 161636287.65 bytes
Eliminated 73717 single appearance grammar rules
75993887 syms, dict. size 6656999, 16.9855 bits/sym, o0e 161349437.74 bytes
Finished embedding grammar rules
Final grammar size: 69336888 (232 terminals + 6656999 rule symbols + 62679657 repeats)
16.2475 bits/symbol, plus 6656999 length codes 2.1720 bits/code, o0e 142625851.68 bytes
Parsed 41796728 level 0 symbols
use_mtf 1, mcl 22 mrcl 21
Compressed 1028290560 bytes -> 109399317 bytes (0.8511 bpB) in 2378.009 seconds.

D:\Ryuugan+_testdata>"turbobenchs_Official_v17.04_-_build_07_Apr_2017.exe" windows_nt_4_source_code.tar -elibdeflate,12/lizard,39 -g -I1 -J31 -k1 -B2G
   186919936    18.2       4.87     834.85   libdeflate 12    windows_nt_4_source_code.tar
   227347207    22.1       1.94    2833.00   lizard 39        windows_nt_4_source_code.tar

D:\Ryuugan+_testdata>

Testmachine: 'Compressionette' - i5-7200u DDR4.

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2017-10-10T17:49:09Z

Thanks for feedback @Sanmayce .
For your information, there are a few advanced parameters you can play with to test some of your ideas.
They are documented in the man page (man zstd on a unix system).
One of them makes it possible to change the minimum match size (ldmSearchLength, which is 64 by default).

--long[=#]:
enables long distance matching with # windowLog (27 by default).
It sets the window size (windowLog) and memory usage for both the
compressor and decompressor.
This setting is designed to improve the compression ratio for files with
long matches at a large distance.

Note: When windowLog is >27, --long=windowLog or
--memory=windowSize needs to be passed to the decompressor.

--zstd[=options]:

ldmHashLog=ldmhlog, ldmhlog=ldmhlog:
Specify the maximum size for a hash table used for long distance matching.

This option is ignored unless long distance matching is enabled.

Bigger hash tables usually improve compression ratio at the expense of more
memory during compression and a decrease in compression speed.

The minimum ldmhlog is 6 and the maximum is 26 (default: 20).
ldmSearchLength=ldmslen, ldmslen=ldmslen:
Specify the minimum searched length of a match for long distance matching.

This option is ignored unless long distance matching is enabled.

Larger/very small values usually decrease compression ratio.

The minumum ldmslen is 4 and the maximum is 4096 (default: 64).
ldmBucketSizeLog=ldmblog, ldmblog=ldmblog:
Specify the size of each bucket for the hash table used for long distance
matching.

This option is ignored unless long distance matching is enabled.

Larger bucket sizes improve collision resolution but decrease compression
speed.

The minimum ldmblog is 0 and the maximum is 8 (default: 3).
ldmHashEveryLog=ldmhevery, ldmhevery=ldmhevery:
Specify the frequency of inserting entries into the long distance matching
hash table.

This option is ignored unless long distance matching is enabled.

Larger values will improve compression speed. Deviating far from the
default value will likely result in a decrease in compression ratio.

The default value is wlog - ldmhlog.

Sanmayce · 2017-10-11T06:18:03Z

Wow, thank you for the detailed help, mutsi!
>The minumum ldmslen is 4 and the maximum is 4096 (default: 64).
Above-and-Beyond my expectations, very nice choice not to butcher the fine granularities!

Since my inclination is to go the least resistant path, in future revisions my intent is to add an 1byte tag to REPEAT the previous/last match-copy without giving any offset (if the two matches are adjacent), thus will reach for 18:1, 32:1, 48:1 and 80:1 ratios. Some primitive ROLZ, I guess.

Sanmayce · 2017-10-14T09:36:25Z

Wanted to put side-by-side some 1GB block performers...

|-------------------------------------------------------------------------------------------------------------------------------|
| Compressor, options                                      |        Size | Process Time | Memory Footprint | Decompression Rate |
|-------------------------------------------------------------------------------------------------------------------------------|
| RAZOR a -d 1023M                                         |  91,510,989 |        2909s |          7632 MB |               N.A. | 
| RAZOR a -d 512M                                          |  91,797,748 |         N.A. |             N.A. |               N.A. |
| RAZOR a -d 32M                                           |  95,947,043 |         N.A. |             N.A. |               N.A. |
| 7za a -t7z -mx9 -md=30                                   | 105,092,126 |         671s |          9882 MB |               N.A. |
| xz -z -k -f -9 -e --lzma2=dict=1024MiB --threads=1       | 105,105,340 |         596s |          9873 MB |               N.A. |
| GLZA c                                                   | 107,706,486 |      167999s |          8715 MB |               N.A. |
| lzturbo 39                                               | 108,959,997 |    0.72 MB/s |             N.A. |        947.54 MB/s |
| GLZA c -f                                                | 109,399,317 |         N.A. |             N.A. |               N.A. |
| ZSTD --ultra -22 --zstd=wlog=31,clog=30,hlog=30,slog=26  | 109,555,123 |        1061s |          9195 MB |               N.A. |
| ZSTD --ultra -22 --zstd=wlog=31,clog=30,hlog=30,slen=3   | 109,577,289 |         971s |          9195 MB |               N.A. |
| ZSTD --ultra -22 --zstd=wlog=31,clog=30,hlog=30          | 109,577,289 |         970s |          9195 MB |               N.A. |
| ZSTD --ultra -22 --zstd=wlog=31,clog=30,hlog=30,tlen=128 | 109,884,117 |         651s |          9195 MB |               N.A. |
| oodle 89                                                 | 113,450,433 |    0.36 MB/s |             N.A. |        896.16 MB/s |
| zstd 22                                                  | 114,437,182 |    1.30 MB/s |             N.A. |        705.78 MB/s |
| ZSTD -b22 --ultra                                        | 114,459,872 |         N.A. |             N.A. |               N.A. |
| ZSTD -b22 --ultra --long=30                              | 116,795,556 |         N.A. |             N.A. |               N.A. |
| BSC e -b1024 -m6 -cp -Tt                                 | 117,657,230 |          50s |          3853 MB |               N.A. |
| ZSTD -b22 --ultra --long                                 | 119,836,913 |         N.A. |             N.A. |               N.A. |
| lzturbo 59                                               | 121,391,408 |    9.02 MB/s |             N.A. |         17.52 MB/s |
| brotli 11                                                | 121,921,545 |    0.45 MB/s |             N.A. |        449.15 MB/s |
| lzturbo 29                                               | 131,654,103 |    0.77 MB/s |             N.A. |       1326.42 MB/s |
| 7za a -tgzip -mx9                                        | 186,336,282 |        1063s |             7 MB |               N.A. |
| libdeflate 12                                            | 186,919,936 |    4.05 MB/s |             N.A. |        632.95 MB/s |
| lizard 39                                                | 227,347,207 |         N.A. |             N.A. |               N.A. |
|-------------------------------------------------------------------------------------------------------------------------------|

Testmachine: laptop i5-2430M 2cores/4threads 16GB DDR3

Performers:

rz_1.01.exe
GLZA_v0.9.2.exe
zstd-v1.3.2-win64.exe
bsc_v3.1.0_x64.exe
7za_v16.04_x64.exe
xz_v5.2.3_x64.exe
turbobenchs_Official_v17.04_-_build_07_Apr_2017.exe

Depending on what one wants, there are several favorite performers, some "features" to consider:

(1) Speed of Decompression;
(2) Compression Strength;
(3) Speed of Compression;
(4) Memory usage, 8GB RAM is somewhat a limitation STILL, not for heavy users, though;
(5) Multi-threading i.e. are more threads beneficial;
(6) Open vs Close source;
(7) Portability across *NIX ARM/PC

Pavlov's 7z, Grebnov's BSC and Yann's Zstd are wonderful in above departments.
Oodle 'Kraken' and LzTurbo 39 are awesome, 1GB source code decompressed in a second!

ghost · 2017-10-14T11:15:57Z

Hello,

thank you for the nice dataset of different compressors. Could you make some tests with 7-Zip ZS also ?

If decompression speed is important, there are some nice levels in Lizard. You could also test brotli and zstd in this implementation. Would be cool to have some comparison :-)

And since you're testing in windows, maybe wtime would help also.

Sanmayce · 2017-10-14T15:08:33Z

Hi,
the thanks have to be redirected to an Australian fellow that shared the testfile.

Inhere my wish was to test the Zstd ability to parse/deduplicate the whole 1GB block in a single-thread and compare with other performers. To avoid flooding this question thread (maybe Yann has to close it) we may proceed on your page.

>Could you make some tests with 7-Zip ZS also ?
Sure, more the merrier.

>Would be cool to have some comparison :-)
Always, sadly my two laptops are not enough to present an heavy-duty picture - lack of threads, lack of 16GB RAM (only 8GB). In essence, you, I and other benchmarkers could come with a console package (similar to my current smashshop package, see the Judaica package download link in the PDF) running via batch file all interesting compressors with all interesting options, when someone comes to the party and says "where is this compressor, why these options are not in use and so on" the package could be enriched easily once most popular performers were there, meaning some saturation has to be reached first. So, please provide the exact command lines with 7z ZS you wanna see, I will add them and upload a new package featuring WinNT4 or book_serie_SPETSNAZ_803_novels_(Russian).tar - the latter is quite cool since it is a natural language corpus, Russian on top of that, and concerning their Special Military Forces literature, hat-trick.

My wish is to present several sets (DNA, XML, TXT, HTML, C Source) in order to test how well the parsing goes beyond 64KB/256KB/16MB/32MB/128MB mark, as these:

0,560,714,120 Oxford_English_Dictionary_2nd_Edition_Version_4_(En-En).dsl
0,648,260,096 gcc-6.3.0_(96398_Files_5502_Folders).tar
0,681,378,816 webdevdata.org_8000_home_pages_from_the_top_10000_most_popular_web_sites.tar
0,681,979,392 book_serie_SPETSNAZ_803_novels_(Russian).tar
0,686,991,360 www.kernel.org_linux-4.8.4.tar
0,975,021,056 www.ncbi.nlm.nih.gov_Dragonfly_(Ladona_fulva)_whole_genome_shotgun.tar
1,000,000,000 enwik9
1,036,155,727 Urban_Dictionary_2015_(Eng-Eng)_utf8.dsl
1,198,966,784 Sacred_Texts_7_(97830_htm_files).tar

Putting aside all criteria which are intersting to many users, my take is that there are some testdatasets which are paragon, that is, indicative to the upmost - one such "file" is OED i.e. Oxford_English_Dictionary_2nd_Edition 534MB long. You see, years will go by but once saying decompressor X "opens" OED at Y GB/s is like a picture instead of thousands words. No how many threads, what RAM, what OS, what clocks, what compiler, ... just Y GB/s.
Of course giving an exhaustive picture (as you almost did on your 7-zip ZS page, the tables/graphs are cool) is an old wish of mine, but lacking the needed hardware stops me going for it. To test multi-threaded performance on my weak 2cores laptops is unacceptable - e.g. 16 threads by Ryzen 1800 or 16 cores by Threadripper is what will get the job done, butchering this part of the multi-faceted review/showdown is as if dropping the ball.

Sanmayce · 2017-10-20T11:17:26Z

Yann, wanted to see how this primitive ROLZ-like thing plays out, so applied it for 4 paths within the decompressor.
For what it's worth, the source code and the compiles of latest and strongest Nakamichi are downloadable at IDZ (Intel Developer Zone) Intel's forum:
https://software.intel.com/sites/default/files/managed/04/36/Nakamichi_Ryuugan-ditto-1TB_%28source_compile_32bit_64bit%29.zip

Sizewise/MatchLenWise priority used in 'Ryuugan-ditto-1TB':

//#01 704:(5+1)+1+1+1+1+1+1+1+1+1+1=44     (1TB)
//#02 320:(5+1)+1+1+1+1=            32     (1TB)
//#03 256:(5+1)+1+1+1=              28.4   (1TB)
//#04 192:(5+1)+1+1=                24     (1TB)
//#05  90:2+1+1=                    22.5  (512B) 
//#06  44:2+1=                      22    (512B)
//#07 120:3+1+1+1=                  20   (128KB)
//#08  60:2+1=                      20    (512B) 
//#09  96:2+1+1+1=                  19.2   (4KB)
//#10 128:(5+1)+1=                  18.2   (1TB)
//#11  90:3+1+1=                    18   (128KB)
//#12  72:2+1+1=                    18     (4KB)
//#13  96:3+1+1+1=                  16     (1MB)
//#14  48:2+1=                      16     (4KB)
//#15  60:3+1=                      15   (128KB)
//#16  30:2=                        15    (512B)
//#17  72:3+1+1=                    14.4   (1MB)
//#18  56:(2+1)+1=                  14     (8KB)
//#19  96:4+1+1+1=                  13.7 (256MB)
//#20  72:4+1+1=                    12   (256MB)
//#21  48:3+1=                      12     (1MB)
//#22  24:2=                        12     (4KB)
//#23  12:1=                        12     (16B)
//#24  56:(3+1)+1=                  11.2   (2MB)
//#25  44:3+1=                      11   (128KB)
//#26  22:2=                        11    (512B)
//#27  64:(5+1)=                    10.6   (1TB)
//#28  30:3=                        10   (128KB)
//#29  48:4+1=                       9.6 (256MB)
//#30  56:(4+1)+1=                   9.3 (512MB)
//#31  28:(2+1)=                     9.3   (8KB)
//#32  36:(2+1)+1=                   9     (8KB)
//#33  56:(5+1)+1=                   8   (128GB)
//#34  24:3=                         8     (1MB)
//#35  16:2=                         8     (4KB)
//#36   8:1=                         8     (16B)
//#37  22:3=                         7.3 (128KB)
//#38  36:(3+1)+1=                   7.2   (2MB)
//#39  28:(3+1)=                     7     (2MB)
//#40  14:2=                         7    (512B)
//#41  36:(4+1)+1=                   6   (512MB)
//#42  24:4=                         6   (256MB)
//#43  18:(2+1)=                     6     (8KB)
//#44  12:2=                         6     (4KB)
//#45  28:(4+1)=                     5.6 (512MB)
//#46  16:3=                         5.3   (1MB)
//#47  36:(5+1)+1=                   5.1 (128GB)
//#48  28:(5+1)=                     4.6 (128GB)
//#49  14:3=                         4.6 (128KB)
//#50  18:(3+1)=                     4.5   (2MB)
//#51  16:4=                         4   (16MB)F
//#52  12:3=                         4     (1MB)
//#53   8:2=                         4     (4KB)
//#54   4:1=                         4     (16B)
//#55  18:(4+1)=                     3.6 (512MB)
//#56  14:4=                         3.5  (16MB)
//#57  18:(5+1)=                     3   (128GB)
//#58  12:4=                         3    (16MB)
//#59   6:2=                         3    (512B)
//#60   8:3=                         2.6   (1MB)
//#61  10:4=                         2.5  (16MB)
//#62   8:4=                         2    (16MB)
//#63   6:3=                         2   (128KB)
//#64   4:2=                         2     (4KB)
//#65   6:4=                         1.5  (16MB)
//#66   4:3=                         1.3   (1MB)

The compression speed is one abysmal atrocity (love it for its dragonicicity), decades on an end needed, someday hope will write the compressor using only external RAM - multi-terabytes housing the B-trees order 3 for each MatchLength across the entire file - not some window- will be SSD frenzy.
https://twitter.com/Sanmayce/status/921325676036280320

Cyan4973 added the question label Oct 10, 2017

Cyan4973 closed this as completed Nov 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback: Deduplication galore/granularity suggestion #886

Feedback: Deduplication galore/granularity suggestion #886

Sanmayce commented Oct 10, 2017

Cyan4973 commented Oct 10, 2017

Sanmayce commented Oct 11, 2017

Sanmayce commented Oct 14, 2017

ghost commented Oct 14, 2017 •

edited by mcmilk

Loading

Sanmayce commented Oct 14, 2017

Sanmayce commented Oct 20, 2017

Feedback: Deduplication galore/granularity suggestion #886

Feedback: Deduplication galore/granularity suggestion #886

Comments

Sanmayce commented Oct 10, 2017

Cyan4973 commented Oct 10, 2017

--zstd[=options]:

Sanmayce commented Oct 11, 2017

Sanmayce commented Oct 14, 2017

ghost commented Oct 14, 2017 • edited by mcmilk Loading

Sanmayce commented Oct 14, 2017

Sanmayce commented Oct 20, 2017

ghost commented Oct 14, 2017 •

edited by mcmilk

Loading