Streaming Decompressor #51

cccs-sadugas · 2020-07-13T16:19:10Z

Pull Request Overview

This pull request implements issue #10 .

It is currently dependant on #50 .

Testing Strategy

This pull request was tested by...

Added relevant unit tests.
Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

This implementation is based off of the LzmaDec_TryDummy function in libhtp's port of the LZMA SDK.

TODO

Could hide this under a feature flag upon request.

Benchmarks

I added a benchmark decompress_stream_big_file that shows that streaming decompression is actually after. This could be related better memory management.

CPU: Intel(R) Core(TM) i7-1065G7 @ 1.30 GHz
RAM: 16Gb @ 4267 MHz
OS: Docker Windows Ubuntu 20.04 LTS

master

test compress_65536                  ... bench:   2,689,440 ns/iter (+/- 667,102)
test compress_empty                  ... bench:       1,629 ns/iter (+/- 346)
test compress_hello                  ... bench:       2,278 ns/iter (+/- 430)
test decompress_after_compress_65536 ... bench:   3,655,315 ns/iter (+/- 6,067,710)
test decompress_after_compress_empty ... bench:       3,977 ns/iter (+/- 1,797)
test decompress_after_compress_hello ... bench:       4,633 ns/iter (+/- 1,811)
test decompress_big_file             ... bench:   7,079,015 ns/iter (+/- 2,228,350)
test decompress_huge_dict            ... bench:       5,359 ns/iter (+/- 4,065)

streaming-decompressor

test compress_65536                  ... bench:   2,746,007 ns/iter (+/- 1,179,775)
test compress_empty                  ... bench:       1,618 ns/iter (+/- 1,428)
test compress_hello                  ... bench:       3,729 ns/iter (+/- 4,189)
test decompress_after_compress_65536 ... bench:   3,134,650 ns/iter (+/- 2,263,524)
test decompress_after_compress_empty ... bench:       3,747 ns/iter (+/- 1,575)
test decompress_after_compress_hello ... bench:       4,558 ns/iter (+/- 1,288)
test decompress_big_file             ... bench:   7,212,875 ns/iter (+/- 2,376,085)
test decompress_huge_dict            ... bench:       4,633 ns/iter (+/- 7,197)
test decompress_stream_big_file      ... bench:   6,670,270 ns/iter (+/- 2,605,148)

Adds a memlimit configuration option for decompression. If the dict buffer's memory limit is exceeded, decompression will fail with an LZMAError. Additional functions were added to reduce the amount of breaking changes in the library.

Move the memlimit check so that it only occurs when we are resizing the buffer.

Changes `LZBuffer` trait to consume the output sink instead of holding a reference to it. This makes it easier to store the sink and avoids self-referential structs. It also makes sense for the buffer to own the sink while it is performing decompression. This also adds the methods `get_ref` and `get_mut` to access the output sink.

Add a streaming mode so processing can work with streaming chunks of data. This is required because process() assumed the input reader contained a complete stream. A CheckState, check() methods and try_process_next() were added to handle when the decompressor requests more input bytes than are available. Data is temporarily buffered in the DecoderState if more input bytes are required to make progress. This commit also adds utility functions to the rangecoder for working with streaming data.

Creates a new struct `Stream` that uses the `std::io::Write` interface to read chunks of compressed data and write them to an output sink.

Adds an option to disable end of stream checks when calling `finish()` on a stream. This is because some users may want to retrieve partially decompressed data.

gendx

Can you rebase on HEAD now that memlimit is merged?
Can you add unit tests that the decompressed output is the same via streaming vs. with the direct interface?
Can you make sure Travis-CI passes?

gendx · 2020-07-14T13:48:22Z

tests/stream.rs

+    #[cfg(feature = "enable_logging")]
+    info!("Compressed {} -> {} bytes", x.len(), compressed.len());


You can use lzma_info! and similar macros instead of info! here and elsewhere. This handles the cfg(feature = "enable_logging") under the hood.

I'm not sure those macros can be imported from the tests folder. This code was copied from tests/lzma.rs.

gendx · 2020-07-14T13:51:50Z

src/decode/lzbuffer.rs

+    // Get a reference to the output sink
+    fn get_ref(&self) -> &W;
+    // Get a mutable reference to the output sink
+    fn get_mut(&mut self) -> &mut W;


Can you provide meaningful names, such as get_output or get_writer or get_sink? Then get_foo_mut.

I went with get_output and get_output_mut.

gendx · 2020-07-14T13:54:25Z

src/decode/lzbuffer.rs

 {
-    stream: &'a mut W, // Output sink


These changes to the interface taking a Write rather than a &'a mut Write make sense. However, they seem decoupled from this pull request, and would be worth submitting as a separate pull request to study their effect on the benchmarks.

Good idea. See PR #54.

gendx · 2020-07-14T13:57:23Z

src/decode/lzma.rs

 {
+    _phantom: std::marker::PhantomData<W>,


use std::marker::PhantomData for the whole file would be clearer. Also it will make it easier to support #43.

gendx · 2020-07-14T13:59:47Z

src/decode/rangecoder.rs

@@ -92,6 +134,23 @@ where
        }
    }

+    #[inline]
+    pub fn decode_bit_check(&mut self, prob: u16) -> io::Result<bool> {


The original decode_bit updates the prob value. I don't understand how prob is updated in this new function.

The point of this function is to avoid updating prob (i.e. it's not declared as a &mut prob). Otherwise, it does not much differently than the non-_check version.

Why not update prob while checking?

Updating probs could cause a scenario where the state is non-recoverable. I think the updated prob value is only meant to be used in the next iteration of the loop anyway, so it should be fine for checking this iteration.

See reference implementation LzmaDec_TryDummy.

gendx · 2020-07-14T14:01:07Z

src/decode/stream.rs

+                        Err(std::io::Error::new(
+                            std::io::ErrorKind::Other,
+                            "failed to read header",
+                        ))


This should be an LZMAError.

I had to implement Display for lzma_rs::error::Error. I put this in a separate PR #53 .

gendx · 2020-07-14T14:02:31Z

src/decode/stream.rs

+
+        let len = match input.fill_buf() {
+            Ok(val) => val,
+            Err(_) => {


The original I/O error shouldn't be lost. It can provide useful information to the user (network error, invalid permissions to read a file, etc.).

gendx · 2020-07-14T14:03:25Z

src/decode/stream.rs

+        output: W,
+        mut input: &mut R,
+        options: &Options,
+    ) -> Result<State<W>, (Option<State<W>>, std::io::Error)> {


(Option<State<W>>, std::io::Error) looks weird for an error type. Can you extend LZMAError for this use case?

Unfortunately that would require adding a type parameter to LZMAError, i.e. LZMAError<W>. I think a less intrusive way would be to create an error type inside of stream.rs and returning that:

Result<State<W>, StreamStateError<W>>

gendx · 2020-07-14T14:04:46Z

src/decode/stream.rs

+    /// Test processing all chunk sizes
+    #[test]
+    fn test_stream_chunked() {
+        let small_input = b"Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll";


This looks like a useful input to have as a test file, so that other unit tests can use it as well.

Added a test file and include it with include_bytes!("../../tests/files/small.txt") given that it's a small input.

gendx · 2020-07-14T14:08:24Z

benches/lzma.rs

+    b.iter(|| {
+        let mut stream = lzma_rs::decompress::Stream::new(Vec::new());
+        stream.write_all(compressed).unwrap();
+        stream.finish().unwrap();


I belive the resulting decompressed output should be the result of the lambda function passed to Bencher::iter, to make sure it's not optimized away. However, other benchmarks don't currently do that, but I can send a pull request to fix that accordingly.

Good catch.

cccs-sadugas · 2020-07-14T16:46:02Z

Thanks :). I will address your comments and open a new PR.

cccs-sadugas added 6 commits July 9, 2020 17:58

lzbuffer: add memlimit

fcbb11f

Adds a memlimit configuration option for decompression. If the dict buffer's memory limit is exceeded, decompression will fail with an LZMAError. Additional functions were added to reduce the amount of breaking changes in the library.

lzbuffer: only check memlimit on resize

52792cd

Move the memlimit check so that it only occurs when we are resizing the buffer.

stream: Streaming API for decompression

3f3dfd8

Creates a new struct `Stream` that uses the `std::io::Write` interface to read chunks of compressed data and write them to an output sink.

stream: add allow_incomplete option

7c701c3

Adds an option to disable end of stream checks when calling `finish()` on a stream. This is because some users may want to retrieve partially decompressed data.

cccs-sadugas changed the title ~~Streaming ecompressor~~ Streaming Decompressor Jul 13, 2020

gendx requested changes Jul 14, 2020

View reviewed changes

cccs-sadugas closed this Jul 14, 2020

This was referenced Jul 14, 2020

Implement Display for Error #53

Merged

lzbuffer: own output sink instead of holding ref #54

Merged

Streaming Decompressor v2 #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming Decompressor #51

Streaming Decompressor #51

cccs-sadugas commented Jul 13, 2020 •

edited

Loading

gendx left a comment

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

gendx Jul 14, 2020

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

gendx Jul 14, 2020

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

gendx Jul 14, 2020

cccs-sadugas Jul 14, 2020

cccs-sadugas commented Jul 14, 2020

		#[cfg(feature = "enable_logging")]
		info!("Compressed {} -> {} bytes", x.len(), compressed.len());

Streaming Decompressor #51

Streaming Decompressor #51

Conversation

cccs-sadugas commented Jul 13, 2020 • edited Loading

Pull Request Overview

Testing Strategy

Supporting Documentation and References

TODO

Benchmarks

gendx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cccs-sadugas commented Jul 14, 2020

cccs-sadugas commented Jul 13, 2020 •

edited

Loading