Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review compression concept #37

Closed
fako1024 opened this issue Aug 15, 2022 · 3 comments · Fixed by #40
Closed

Review compression concept #37

fako1024 opened this issue Aug 15, 2022 · 3 comments · Fixed by #40
Assignees
Labels
feature New feature or request
Milestone

Comments

@fako1024
Copy link
Collaborator

fako1024 commented Aug 15, 2022

Since we are planning to perform a conversion step for v4 anyway it could be reasonable to have another look at the current compression concept, e.g.

  • Consider switching to (newest) default LZ4 build (allows direct linking against system-wide liblz4), which might have received significant updates (including performance related ones that might outweigh the benefits of this specific build)
  • Create dictionaries (as supported by LZ4, similar to ZStandard) for columns with low cardinality (e.g. Ports, Proto) to speed up their de(compression)
  • Consider skipping compression for columns with high entropy, e.g. bytes / packets (although at typical LZ4 read speeds it might not matter spped-wise)
  • Use zero-allocation compression (in-place) or at least explicitly re-use memory buffers
@fako1024 fako1024 added the feature New feature or request label Aug 15, 2022
@fako1024 fako1024 added this to the v4 Release milestone Aug 15, 2022
@fako1024 fako1024 self-assigned this Aug 15, 2022
@fako1024
Copy link
Collaborator Author

Important note: In the recent release of LZ4 (https://github.com/lz4/lz4/releases/tag/v1.9.4) there were some generic performance improvements, but more importantly an option was added to skip the CRC check during decompression (the main reason why we opted for the customized version in this package). So it might be worthwhile to simply link against system-level liblz4 and drop the customized (and outdated) version entirely without reduction of performance (chances are it'll even be faster) and cover this in the conversion tool.

@els0r
Copy link
Owner

els0r commented Aug 18, 2022

Then we could amend the Encoder enum and rename the current LZ4 to LZ4Custom and introduce another enum value LZ4.

For the conversion tool there are two scenarios then:

  1. A <=v3 goDB is converted to the v4 format: the system LZ4 library is taken (v1.9.4)
  2. A v4 block is converted to LZ4 every time the conversion tool reads LZ4Custom in the encoder bit

@fako1024
Copy link
Collaborator Author

Indeed, and then after an intermediate release we can then drop the LZ4Custom completely for the v4 release (otherwise we can't get rid of the code of course). As long as we can build the conversion tool from a well defined release / tag everyone can perform this conversion by just building the tool from that branch.

els0r pushed a commit that referenced this issue Jan 3, 2023
At the moment, downloads and builds lz4 1.9.4 and links against it
during the build process.

Does not (yet) install the library on the system.
els0r pushed a commit that referenced this issue Jan 5, 2023
Makes use of LZ4F functions for compression and decompression. This
slightly increases the size of goDB, but gains in decompression performance,
now using lz4 >=1.9.4.

This commit introduces a _hard_ dependency on lz4 1.9.4
@els0r els0r linked a pull request Jan 5, 2023 that will close this issue
els0r pushed a commit that referenced this issue Jan 18, 2023
Internally changes the compress bound handling for lz4 compression.
It now uses the library's provided API function LZ4F_compressFrameBound()
to determine the worst-case size of the destination buffer.

Also adds a benchmark to compare the compression speed for different lz4
compression levels
els0r pushed a commit that referenced this issue Jan 18, 2023
Allows to set the encoder for high cardinality encoder (currently
defaults to `NullEncoder`).

Also optimizes the case when Null encoded data is read from a block
els0r pushed a commit that referenced this issue Jan 20, 2023
Prints human-readable error upon decompression failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants