Review compression concept #37

fako1024 · 2022-08-15T06:49:33Z

Since we are planning to perform a conversion step for v4 anyway it could be reasonable to have another look at the current compression concept, e.g.

Consider switching to (newest) default LZ4 build (allows direct linking against system-wide liblz4), which might have received significant updates (including performance related ones that might outweigh the benefits of this specific build)
Create dictionaries (as supported by LZ4, similar to ZStandard) for columns with low cardinality (e.g. Ports, Proto) to speed up their de(compression)
Consider skipping compression for columns with high entropy, e.g. bytes / packets (although at typical LZ4 read speeds it might not matter spped-wise)
Use zero-allocation compression (in-place) or at least explicitly re-use memory buffers

fako1024 · 2022-08-18T10:32:34Z

Important note: In the recent release of LZ4 (https://github.com/lz4/lz4/releases/tag/v1.9.4) there were some generic performance improvements, but more importantly an option was added to skip the CRC check during decompression (the main reason why we opted for the customized version in this package). So it might be worthwhile to simply link against system-level liblz4 and drop the customized (and outdated) version entirely without reduction of performance (chances are it'll even be faster) and cover this in the conversion tool.

els0r · 2022-08-18T18:00:14Z

Then we could amend the Encoder enum and rename the current LZ4 to LZ4Custom and introduce another enum value LZ4.

For the conversion tool there are two scenarios then:

A <=v3 goDB is converted to the v4 format: the system LZ4 library is taken (v1.9.4)
A v4 block is converted to LZ4 every time the conversion tool reads LZ4Custom in the encoder bit

fako1024 · 2022-08-19T07:00:16Z

Indeed, and then after an intermediate release we can then drop the LZ4Custom completely for the v4 release (otherwise we can't get rid of the code of course). As long as we can build the conversion tool from a well defined release / tag everyone can perform this conversion by just building the tool from that branch.

At the moment, downloads and builds lz4 1.9.4 and links against it during the build process. Does not (yet) install the library on the system.

Makes use of LZ4F functions for compression and decompression. This slightly increases the size of goDB, but gains in decompression performance, now using lz4 >=1.9.4. This commit introduces a _hard_ dependency on lz4 1.9.4

Internally changes the compress bound handling for lz4 compression. It now uses the library's provided API function LZ4F_compressFrameBound() to determine the worst-case size of the destination buffer. Also adds a benchmark to compare the compression speed for different lz4 compression levels

Allows to set the encoder for high cardinality encoder (currently defaults to `NullEncoder`). Also optimizes the case when Null encoded data is read from a block

Prints human-readable error upon decompression failure

fako1024 added the feature New feature or request label Aug 15, 2022

fako1024 added this to the v4 Release milestone Aug 15, 2022

fako1024 self-assigned this Aug 15, 2022

els0r pushed a commit that referenced this issue Dec 21, 2022

[goDB][compression][lz4] #37: link against system lz4 library

dde1cf3

els0r pushed a commit that referenced this issue Jan 3, 2023

[build] #37: build against lz4 1.9.4

2cf0bc1

At the moment, downloads and builds lz4 1.9.4 and links against it during the build process. Does not (yet) install the library on the system.

els0r linked a pull request Jan 5, 2023 that will close this issue

37 review compression concept #40

Merged

els0r pushed a commit that referenced this issue Jan 18, 2023

[compression][goDB] #37: high cardinality encoder support

003e075

Allows to set the encoder for high cardinality encoder (currently defaults to `NullEncoder`). Also optimizes the case when Null encoded data is read from a block

els0r pushed a commit that referenced this issue Jan 20, 2023

[lz4] #37: more descriptive errors

afb6adf

Prints human-readable error upon decompression failure

fako1024 mentioned this issue Jan 25, 2023

39 simplify the table printer #42

Merged

fako1024 closed this as completed in #40 Jan 30, 2023

fako1024 mentioned this issue Jan 31, 2023

Improved map key & in-memory data structure concept #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review compression concept #37

Review compression concept #37

fako1024 commented Aug 15, 2022 •

edited

Loading

fako1024 commented Aug 18, 2022

els0r commented Aug 18, 2022

fako1024 commented Aug 19, 2022

Review compression concept #37

Review compression concept #37

Comments

fako1024 commented Aug 15, 2022 • edited Loading

fako1024 commented Aug 18, 2022

els0r commented Aug 18, 2022

fako1024 commented Aug 19, 2022

fako1024 commented Aug 15, 2022 •

edited

Loading