Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace lz4 with lz4_flex Allowing Compilation for WASM #4884

Merged
merged 12 commits into from
Oct 2, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Oct 1, 2023

Which issue does this PR close?

Relates to apache/datafusion#7652 and apache/datafusion#7653

Rationale for this change

lz4_flex is a pure Rust implementation of lz4 that achieves similar performance to the C library, but with the benefit of being compatible with WASM

What changes are included in this PR?

Are there any user-facing changes?

No, the only changes are to experimental modules

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2023
@tustvold tustvold changed the title Use lz4_flex Replace lz4 with lz4_flex Allowing Compilation for WASM Oct 1, 2023
@@ -386,9 +386,6 @@ fn convert_csv_to_parquet(args: &Args) -> Result<(), ParquetFromCsvError> {
Compression::BROTLI(_) => {
Box::new(brotli::Decompressor::new(input_file, 0)) as Box<dyn Read>
}
Compression::LZ4 => Box::new(lz4::Decoder::new(input_file).map_err(|e| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will decode lz4 data encoded without any framing, which is so niche that I struggle to conceive of people relying on this functionality. Further this is a utility CLI tool, and so I'm not too concerned about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- We can update the CSV tool if eeded

@@ -383,64 +383,6 @@ impl BrotliLevel {
}
}

#[cfg(any(feature = "lz4", test))]
mod lz4_codec {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This codec has been replaced by LZ4HadoopCodec so lets just remove it, it isn't used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "replaced"? Is there something in the parquet standard?

Copy link
Contributor Author

@tustvold tustvold Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3013

Basically the standard didn't specify the framing and so the ecosystem ended up with two 😄

That PR replaced this codec with a LZ4HadoopCodec which has an automatic fallback, this is what has been being used since then

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me -- do we have any performance numbers?

Also, I don't understand the "replaced lz4 with lz4hadoopcodec" comment. I probably am missing something.

cc @sunchao who might have more context on parquet / compression formats (or know someone who does)

@@ -386,9 +386,6 @@ fn convert_csv_to_parquet(args: &Args) -> Result<(), ParquetFromCsvError> {
Compression::BROTLI(_) => {
Box::new(brotli::Decompressor::new(input_file, 0)) as Box<dyn Read>
}
Compression::LZ4 => Box::new(lz4::Decoder::new(input_file).map_err(|e| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- We can update the CSV tool if eeded

@@ -383,64 +383,6 @@ impl BrotliLevel {
}
}

#[cfg(any(feature = "lz4", test))]
mod lz4_codec {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "replaced"? Is there something in the parquet standard?

@alamb
Copy link
Contributor

alamb commented Oct 1, 2023

Thank you for this @tustvold 🙏

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 1, 2023
@kylebarron
Copy link
Contributor

I'm excited for this! I maintain https://github.com/kylebarron/parquet-wasm, which up until now hasn't been able to support lz4 for the arrow/parquet bindings.

It might be of use to note that arrow2/parquet2 implemented support for both lz4 and lz4_flex, so that the end user could choose which to enable. jorgecarleitao/parquet2#124

This implementation a bit slower but uses no unsafe and is written in native Rust, therefore supporting being compiled to wasm.

@tustvold
Copy link
Contributor Author

tustvold commented Oct 1, 2023

Running the benchmarks shows this does appear to regress performance

compress LZ4 - alphanumeric
                        time:   [116.18 µs 116.44 µs 116.74 µs]
                        change: [+0.7922% +1.1707% +1.5607%] (p = 0.00 < 0.05)
                        Change within noise threshold.

LZ4 compressed 1048576 bytes of alphanumeric to 1052698 bytes
decompress LZ4 - alphanumeric
                        time:   [34.815 µs 34.839 µs 34.865 µs]
                        change: [+20.848% +21.196% +21.530%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

compress LZ4_RAW - alphanumeric
                        time:   [117.29 µs 117.50 µs 117.73 µs]
                        change: [+4.0342% +4.3202% +4.6091%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

LZ4_RAW compressed 1048576 bytes of alphanumeric to 1052690 bytes
decompress LZ4_RAW - alphanumeric
                        time:   [33.121 µs 33.139 µs 33.159 µs]
                        change: [+9.0540% +9.5426% +10.041%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

Benchmarking compress LZ4 - words: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
compress LZ4 - words    time:   [1.5822 ms 1.5831 ms 1.5840 ms]
                        change: [+14.406% +14.490% +14.573%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  10 (10.00%) high mild
  2 (2.00%) high severe

LZ4 compressed 1048576 bytes of words to 408369 bytes
decompress LZ4 - words  time:   [253.21 µs 253.31 µs 253.42 µs]
                        change: [+3.8904% +3.9696% +4.0484%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking compress LZ4_RAW - words: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
compress LZ4_RAW - words
                        time:   [1.5648 ms 1.5653 ms 1.5659 ms]
                        change: [+13.811% +13.877% +13.940%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  1 (1.00%) high severe

LZ4_RAW compressed 1048576 bytes of words to 408361 bytes
decompress LZ4_RAW - words
                        time:   [253.63 µs 253.73 µs 253.84 µs]
                        change: [+3.8637% +3.9363% +4.0044%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  5 (5.00%) high mild
  2 (2.00%) high severe

In particular we see regressions

  • ~10-20% when decompressing non-compressible input
  • ~15% when compressing compressible input

These benchmarks represent two pretty extreme cases, and it is likely that most realistic workloads sit somewhere inbetween and would see fairly minor regressions to both decompression and compression.

This is consistent with lz4_flex's own benchmarking which shows lz4_flex tending to perform better than lz4 at one of either compression or decompression for a given corpus.

I personally am happy enough with the performance to not feel the need to deal with the complexity of maintaining two possible implementations, especially given how rarely LZ4 is used in the ecosystem (it was only properly standardised a few years ago), but welcome other opinions

@tustvold tustvold merged commit 3b0ede4 into apache:master Oct 2, 2023
31 checks passed
@sunchao
Copy link
Member

sunchao commented Oct 2, 2023

Also, I don't understand the "replaced lz4 with lz4hadoopcodec" comment. I probably am missing something.
cc @sunchao who might have more context on parquet / compression formats (or know someone who does)

@alamb There are now two LZ4 compression codes in Parquet: the old/deprecated "Hadoop" LZ4 and the new LZ4_RAW, due to the framing issue @tustvold mentioned.

There's an email thread and discussions on this: https://www.mail-archive.com/dev@parquet.apache.org/msg14529.html

@alamb
Copy link
Contributor

alamb commented Oct 2, 2023

There's an email thread and discussions on this: https://www.mail-archive.com/dev@parquet.apache.org/msg14529.html

Despite several attempts by the parquet-cpp developers, we were not
able to reach the point where LZ4-compressed Parquet files are
bidirectionally compatible between parquet-cpp and parquet-mr. Other
implementations are having, or have had, similar issues.  My conclusion
is that the Parquet spec doesn't allow independent reimplementation of
the LZ4 compression format required by parquet-mr. Therefore, LZ4
compression should be removed from the spec (possibly replaced with
another enum value for a properly-specified, interoperable, LZ4-backed
compression scheme).

That is about as good as a rationale for removing LZ4 as I have heard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants