btlazy2 strategy is incredibly slow on highly repetitive data #100

benrg · 2015-12-19T01:51:38Z

For example, on a file containing 10,000 repetitions of "All work and no play makes Jack a dull boy.\n" (440,000 bytes total), zstd -b15 gives about 23 MB/s on my laptop while zstd -b16 and higher give about 0.02 MB/s. I had to add another digit to the speed output to see anything but 0.0. I assume the switch to the btlazy2 strategy is what makes the difference.

Cyan4973 · 2015-12-19T11:24:30Z

Yes, I suspect repeating pattern to have devastating effects on btlazy2 strategy.

Not sure if there is a simple, non-detrimental, solution to it.
Repeating patterns are not that common, except for the trivial case of a single repeating character (which is already taken care of). Complex solution take a toll for every operation, and ends up making btlazy2 slower for the general use case.

That being said, I'm opened to any suggestion / patch.

tomByrer · 2015-12-19T16:49:55Z

Sometimes JSON, XML/HTML/SVG, & simple array data (like some NoSQL blobs that are templated, maps in games, etc) can be repetitive.

Only suggestion: Have a separate mode that is better at repeated patterns.
Phase 1: repeated pattern mode switched on manually via CLI flag. Someone who doesn't care about compression time could write CLI script to compare with vs without, & throw out the worst compression.
Phase 2: The above comparison could be done internally via 2nd switch.
Phase 3. Comparison can be done with smaller segments.

KrzysFR · 2015-12-19T19:06:10Z

Couldn't repetitions in JSON/XML be addressed by good dictionary support that would take care of that, leaving only the core content to be handled by a more suitable algorithm?

benrg · 2015-12-19T20:30:49Z

I originally noticed this trying to compress Windows\Logs\CBS\CBS.log, so it does occur with some real-world data. It makes the high compression levels DOSable, which seems like a cause for concern. It's tricky for clients that need to avoid that to do so right now because the maximum non-btlazy2 compression level depends on the input size hint and the library version.

Is the problem that the hash chains grow linearly so the total search time is quadratic? It's not obvious to me how to avoid that, but other libraries (such as LZMA SDK) do somehow avoid it in their maximum compression modes with large windows.

Cyan4973 · 2015-12-20T15:25:19Z

The problem is limited to btlazy2, as it uses a binary tree.
The hash chain methodology, used within lazy2, is less affected : it pays a heavy price during search, but following insertions are fast, and the dangerous section is quickly skipped since it compresses well.
By contrast, binary tree pays a heavy price during search _and_ insertion.

Let's keep this issue opened. A solution will be needed to handle such case gracefully, without impacting too much the more general situation where repetitive data is either absent or in limited proportion.

Cyan4973 · 2015-12-29T21:29:59Z

There's a new update into "dev" branch (https://github.com/Cyan4973/zstd/tree/dev)
which specific targets repetitive data with btlazy2 (strongest) modes.

In my tests, it dramatically improves speed in presence of repetitive segments of any period.

The cost for it is pretty small : normal data tend to compress slightly less, but I expect this difference to be negligible in most circumstances. The good news is that speed is not worsened for normal data.

For your testings. This is still experimental stuff.

Cyan4973 · 2016-01-09T23:21:04Z

Merged into master

Cyan4973 closed this as completed Jan 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

btlazy2 strategy is incredibly slow on highly repetitive data #100

btlazy2 strategy is incredibly slow on highly repetitive data #100

benrg commented Dec 19, 2015

Cyan4973 commented Dec 19, 2015

tomByrer commented Dec 19, 2015

KrzysFR commented Dec 19, 2015

benrg commented Dec 19, 2015

Cyan4973 commented Dec 20, 2015

Cyan4973 commented Dec 29, 2015

Cyan4973 commented Jan 9, 2016

btlazy2 strategy is incredibly slow on highly repetitive data #100

btlazy2 strategy is incredibly slow on highly repetitive data #100

Comments

benrg commented Dec 19, 2015

Cyan4973 commented Dec 19, 2015

tomByrer commented Dec 19, 2015

KrzysFR commented Dec 19, 2015

benrg commented Dec 19, 2015

Cyan4973 commented Dec 20, 2015

Cyan4973 commented Dec 29, 2015

Cyan4973 commented Jan 9, 2016