Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

higher zstd compression level resulting in larger compressed data #3793

Open
revintec opened this issue Oct 16, 2023 · 5 comments
Open

higher zstd compression level resulting in larger compressed data #3793

revintec opened this issue Oct 16, 2023 · 5 comments
Assignees
Labels

Comments

@revintec
Copy link

revintec commented Oct 16, 2023

using zstd 1.5.5, latest version as of writing
prepare an int array(each int occupies 4 bytes, little endian)
[0,30,60,90,...] 65536 ints, 65536*4 bytes
then compress it using various compression levels(simple compression, no dict):

2023-10-17 02:14:22.792 TRACE original size: 262144
2023-10-17 02:14:22.851 TRACE level 0 204162
2023-10-17 02:14:22.852 TRACE level 1 204103
2023-10-17 02:14:22.852 TRACE level 2 204118
2023-10-17 02:14:22.853 TRACE level 3 204162
2023-10-17 02:14:22.854 TRACE level 4 204136
2023-10-17 02:14:22.856 TRACE level 5 204147
2023-10-17 02:14:22.858 TRACE level 6 204141
2023-10-17 02:14:22.860 TRACE level 7 204161
2023-10-17 02:14:22.862 TRACE level 8 204161
2023-10-17 02:14:22.863 TRACE level 9 204161
2023-10-17 02:14:22.865 TRACE level 10 204161
2023-10-17 02:14:22.868 TRACE level 11 204165
2023-10-17 02:14:22.871 TRACE level 12 204161
2023-10-17 02:14:22.877 TRACE level 13 204143
2023-10-17 02:14:22.893 TRACE level 14 83240
2023-10-17 02:14:22.907 TRACE level 15 83240
2023-10-17 02:14:22.923 TRACE level 16 83242
2023-10-17 02:14:22.940 TRACE level 17 83242
2023-10-17 02:14:22.958 TRACE level 18 142849
2023-10-17 02:14:22.976 TRACE level 19 142849
2023-10-17 02:14:22.998 TRACE level 20 142849
2023-10-17 02:14:23.017 TRACE level 21 142849
2023-10-17 02:14:23.035 TRACE level 22 142849

as seen from the above output, higher compression level(18) starts resulting in larger compressed data

  1. is that in line with exceptions? I thought higher compression level should resulting in smaller compressed data, this one is over 70% larger. how can I produce the smallest output data(ignoring compress time and/or memory consumption)?
    -- a search using compression level size in issues results in no relative information in the first page, nor relative result in google :( sorry if this has already been brought up

and there's a related questions I'm putting into a same issue(forgive me :)

  1. the input data is relatively simple(low entropy), why isn't it compressed more? is there any tweak/flags that I should enable? the original data is a tsdb timestamp series, and I'd like not to change them(rearranging bytes or manually do delta compression), is there a recommend way to handle semi arithmetic progression/sequence case? (n.b. the delta is not always the same, it maybe 30,30,30,300,300,30,3600,3600,86400,30,30
@Cyan4973
Copy link
Contributor

This can happen, though this is not the expectation.

All these algorithms make bets, even the strongest ones.
They bet on a certain distribution of statistics, a certain probability of apparition of events in the future, that may turn out to be wrong.

We already know that these bets can be weaponized to make them fail. It is more likely to happen in presence of synthetic data, which tend to be unrepresentative of real-world scenarios. But of course, from time to time, real scenarios can be affected too.

The solution is far from trivial. It generally requires access to source data, in order to understand its behavior, and how it's analyzed by the algorithm. Then, sometimes, an heuristic can be tweaked to better handle this corner case. Other times, it's not, or it comes at the expense of another corner case.

Long term, we could imagine an even stronger (and slower) compression parser that would feature less bets into the future, because it would have access to more information to evaluate trade-offs. Sadly it hasn't happened yet, because it's a very time-consuming task, and we don't have real production workloads which would benefit from this effort. Also, the gains are expected to be small, i.e. < 1% for general cases, excluding pathological corner cases. So this effort tends to be postponed repetitively. It's still in the cards, and I still hope it will happen someday.

For the time being, our best option is to have a look at source data, if it's shareable, and try to learn from it.

@Cyan4973 Cyan4973 self-assigned this Oct 16, 2023
@basid-irk
Copy link

> ver & zstd -V & zstd -b1e19 | sort /+2
Microsoft Windows [Version 6.1.7601]
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
16#Synthetic 50%     :  10000000 ->   3080678 (x3.246),   6.85 MB/s, 1769.4 MB/s
 2#Synthetic 50%     :  10000000 ->   3129137 (x3.196),  443.6 MB/s, 1810.9 MB/s
17#Synthetic 50%     :  10000000 ->   3136878 (x3.188),   2.92 MB/s, 1537.9 MB/s
19#Synthetic 50%     :  10000000 ->   3140424 (x3.184),   2.21 MB/s, 1480.5 MB/s
18#Synthetic 50%     :  10000000 ->   3145664 (x3.179),   2.67 MB/s, 1463.4 MB/s
 1#Synthetic 50%     :  10000000 ->   3154228 (x3.170),  565.7 MB/s, 1854.6 MB/s
-- nonsense in bech --
 3#Synthetic 50%     :  10000000 ->   3230847 (x3.095),  238.1 MB/s, 1461.9 MB/s
 6#Synthetic 50%     :  10000000 ->   3280418 (x3.048),   95.7 MB/s, 1200.6 MB/s
 5#Synthetic 50%     :  10000000 ->   3292183 (x3.037),  108.0 MB/s, 1178.9 MB/s
 8#Synthetic 50%     :  10000000 ->   3318556 (x3.013),   75.3 MB/s, 1001.4 MB/s
 7#Synthetic 50%     :  10000000 ->   3327348 (x3.005),   85.8 MB/s,  980.3 MB/s
 4#Synthetic 50%     :  10000000 ->   3345685 (x2.989),  186.1 MB/s, 1277.5 MB/s
15#Synthetic 50%     :  10000000 ->   3353801 (x2.982),   8.00 MB/s, 1068.8 MB/s
14#Synthetic 50%     :  10000000 ->   3354678 (x2.981),   9.75 MB/s,  749.2 MB/s
13#Synthetic 50%     :  10000000 ->   3354692 (x2.981),   9.28 MB/s, 1068.3 MB/s
 9#Synthetic 50%     :  10000000 ->   3355994 (x2.980),   54.2 MB/s,  834.1 MB/s
12#Synthetic 50%     :  10000000 ->   3362882 (x2.974),   24.1 MB/s,  608.8 MB/s
10#Synthetic 50%     :  10000000 ->   3363166 (x2.973),   35.3 MB/s, 1030.5 MB/s
11#Synthetic 50%     :  10000000 ->   3363170 (x2.973),   25.8 MB/s, 1045.0 MB/s

Cyan4973 added a commit that referenced this issue Jan 31, 2024
this works great for 32-bit arrays,
notably the synthetic ones, with extreme regularity,
unfortunately, it's not universal,
and in some cases, it's a loss.
Crucially, on average, it's a loss on silesia.
The most negatively impacted file is x-ray.
It deserves an investigation before suggesting it as an evolution.
@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 4, 2024

Current situation regarding this specific scenario :

 1#issue3793.bin     :    262144 ->    204103 (x1.284),  
 2#issue3793.bin     :    262144 ->    204118 (x1.284),  
 3#issue3793.bin     :    262144 ->    204162 (x1.284),  
 4#issue3793.bin     :    262144 ->    204136 (x1.284), 
 5#issue3793.bin     :    262144 ->    204147 (x1.284),   
 6#issue3793.bin     :    262144 ->    204141 (x1.284),  
 7#issue3793.bin     :    262144 ->    204161 (x1.284),  
 8#issue3793.bin     :    262144 ->    204161 (x1.284),  
 9#issue3793.bin     :    262144 ->    204161 (x1.284),  
10#issue3793.bin     :    262144 ->    204161 (x1.284),  
11#issue3793.bin     :    262144 ->    204165 (x1.284),  
12#issue3793.bin     :    262144 ->    204161 (x1.284),  
13#issue3793.bin     :    262144 ->    204143 (x1.284),  
14#issue3793.bin     :    262144 ->     83240 (x3.149),  
15#issue3793.bin     :    262144 ->     83240 (x3.149),   
16#issue3793.bin     :    262144 ->     83242 (x3.149),   
17#issue3793.bin     :    262144 ->     83242 (x3.149),   
18#issue3793.bin     :    262144 ->     82866 (x3.163),  
19#issue3793.bin     :    262144 ->     82866 (x3.163), 
20#issue3793.bin     :    262144 ->     82866 (x3.163),  
21#issue3793.bin     :    262144 ->     82866 (x3.163),   
22#issue3793.bin     :    262144 ->     82866 (x3.163),   

Compression ratio could be even better at higher compression levels, but at least performance doesn't degrade anymore.

@revintec
Copy link
Author

revintec commented Mar 4, 2024

I'll provide some context here:
these extremely regular data may be quite "real" and "common" actually, think about index files
I think they're currently rare(in issue reports and/or testing corpus) because current compression algorithms can't handle them well, thus users give up and hand write some delta compression themselves(it's easy to write and solves the problem quickly

I've seen them on other occasions like elastic-search's index file too
they're also commonly pre-compressed using the delta algorithm([100,130,160,190,200,210,240...] -> [100,30,30,30,30,10,10,30...]), making it easier for compression algorithms

the original data I referenced is from a tsdb timestamp, thus they increment 30(seconds) for every sample, with some increments like 3600(1 hour) and/or 86400(1 day). the original code used the hand written delta compression(with varint and other tricks) but no zstd. I tried to replace the hand written algorithm with zstd to see if I can make the code more general and easier to maintain, while still providing comparable compression benefits

the unexpected is not (only) the compression ratio per say, but also the higher compression level resulting in noticeably larger size. I thought that higher level would internally try all heuristics in all lower levels

  1. is it possible for zstd to produce smaller files for these cases? e.g. doing a new kind of afore-mentioned pre-delta + varint heuristic in high compression levels
  2. does Improve compression of Arrays of Integers (High compression mode) #3895 works on data other than int arrays? e.g. long array or double array, etc

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 4, 2024

  1. does Improve compression of Arrays of Integers (High compression mode) #3895 works on data other than int arrays? e.g. long array or double array, etc

Yes, but I believe the situation was not as bad for arrays of 64-bit values. Therefore, comparatively, arrays of 32-bit receive a more substantial boost.

  1. is it possible for zstd to produce smaller files for these cases? e.g. doing a new kind of afore-mentioned pre-delta + varint heuristic in high compression levels

Not without breaking the format.
Such preprocessing would be better provided by a higher layer.

hswong3i pushed a commit to alvistack/facebook-zstd that referenced this issue Mar 27, 2024
this works great for 32-bit arrays,
notably the synthetic ones, with extreme regularity,
unfortunately, it's not universal,
and in some cases, it's a loss.
Crucially, on average, it's a loss on silesia.
The most negatively impacted file is x-ray.
It deserves an investigation before suggesting it as an evolution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants