-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback: Deduplication galore/granularity suggestion #886
Comments
Thanks for feedback @Sanmayce .
--zstd[=options]:
|
Wow, thank you for the detailed help, mutsi! Since my inclination is to go the least resistant path, in future revisions my intent is to add an 1byte tag to REPEAT the previous/last match-copy without giving any offset (if the two matches are adjacent), thus will reach for 18:1, 32:1, 48:1 and 80:1 ratios. Some primitive ROLZ, I guess. |
Wanted to put side-by-side some 1GB block performers...
Testmachine: laptop i5-2430M 2cores/4threads 16GB DDR3 Performers:
Depending on what one wants, there are several favorite performers, some "features" to consider:
Pavlov's 7z, Grebnov's BSC and Yann's Zstd are wonderful in above departments. |
Hello, thank you for the nice dataset of different compressors. Could you make some tests with 7-Zip ZS also ? If decompression speed is important, there are some nice levels in Lizard. You could also test brotli and zstd in this implementation. Would be cool to have some comparison :-) And since you're testing in windows, maybe wtime would help also. |
Hi, Inhere my wish was to test the Zstd ability to parse/deduplicate the whole 1GB block in a single-thread and compare with other performers. To avoid flooding this question thread (maybe Yann has to close it) we may proceed on your page. >Could you make some tests with 7-Zip ZS also ? >Would be cool to have some comparison :-) My wish is to present several sets (DNA, XML, TXT, HTML, C Source) in order to test how well the parsing goes beyond 64KB/256KB/16MB/32MB/128MB mark, as these:
Putting aside all criteria which are intersting to many users, my take is that there are some testdatasets which are paragon, that is, indicative to the upmost - one such "file" is OED i.e. Oxford_English_Dictionary_2nd_Edition 534MB long. You see, years will go by but once saying decompressor X "opens" OED at Y GB/s is like a picture instead of thousands words. No how many threads, what RAM, what OS, what clocks, what compiler, ... just Y GB/s. |
Yann, wanted to see how this primitive ROLZ-like thing plays out, so applied it for 4 paths within the decompressor. Sizewise/MatchLenWise priority used in 'Ryuugan-ditto-1TB':
The compression speed is one abysmal atrocity (love it for its dragonicicity), decades on an end needed, someday hope will write the compressor using only external RAM - multi-terabytes housing the B-trees order 3 for each MatchLength across the entire file - not some window- will be SSD frenzy. |
Hi Yann, hi Stella,
very nice to see this feature, it is a must with these mountains of texts coming unceasingly.
My suggestion/question is about the granularity of deduplication currently implemented.
Haven't looked into source code of any deduplicator, but remember my friend Lasse using 6 years ago, like, 8KB in his eXdupe.
I asked Christian about granularity of RAZOR, he said:
So, what size are current MinMatches, used for, say, 512MB or 29bits in Stella's etude?
My Zennish textual microdeduplicator uses 128GB window or (5+1)*8-8-3=37bits with these "long" matches:
And my suggestion, my experience with e-books dictates deduplication to be done in tiniest orders AS WELL, somewhere between 40 and 80 bytes (talking about 1byte per character).
Simply because a huge textual collection usually contains many different editions (with their respective formatting/wrapping), for example, one sentence could be the same across 7-8 editions of 'Don Quixote' but the actual appearance to be on one physical or two lines. I mean the [CR]LF delimiter breaks the long matches. So, my estimation for order 48 deduplication is roughly 48:6=8:1 compression ratio, whereas the 'bsc' cannot reach full 6:1 for my main corpus 'GAMERA' (336GB strong - pure English text), it suffers from not being able to deduplicate.
My dummy guess is that 40..80 range across 128MB..336GB is not implemented yet, am I wrong?
If we glimpse at the incoming near future we could see already 1TB RAM or 8x128GB kits with AMD Threadripper motherboards. My drift, imagine a specific task of full-text traversing those 336GB IN-MEMORY! Which compressor of today can see backwards "long" matches that far!? NONE!
My wish is one day, say, 5 years past now, monstrous hardware to meet monstrous decompressor, simple.
Nakamichi ‘Dragoneye’ targets textual data - ebooks of any caliber, especially anthologies/encyclopedias and big/huge/tera corpora of texts, yet, it needs many years to compress and I see it already outdated with its 128GB limit 😿 🤣
To see that even 22 volumes of 105MB pure text are not that easy to compress:
https://twitter.com/Sanmayce/status/917009706761236485
The Linux kernelS benchmark is super, here is my quick test with 1GB of source code, with interesting (unlike English texts but with abundant 8+ long matches across entire pool) histogram:
To me, Lizard 39 (being LZ4 strengthened parsewise, and entropyfied) is kinda micro Zstd, no? Anyway, its 2833 MB/s are supercool - 2x weaker but 3x faster than Zstd "vanilla".
Testmachine: 'Compressionette' - i5-7200u DDR4.
The text was updated successfully, but these errors were encountered: