Streaming Decompressor v4 #59

cccs-sadugas · 2020-10-05T20:34:45Z

Pull Request Overview

Changes since last PR:

removed YODA-style if-statements to improve readability
removed do-while style logic to make a more readable loop
removed confusing RangeDecoder::remaining function
removed unnecessary clippy lint
removed range checking when casting to u64
renamed Mode to ProcessingMode and added a ProcessingStatus instead of using a bool
renamed tmp to partial_input_buf to make it more clear
fixed indentation on multi-line byte strings
made RangeDecoder members public instead of having getters
added more documentation to why unpacked_size_write_none_to_header_and_use_provided_on_read causes a WriteZero error
added a test utility function for reading a file into a Vec<u8>
added comment as to why MAX_REQUIRED_INPUT is 20 bytes
added a utility function to read_partial_input_buf

Testing Strategy

This pull request was tested by...

Added relevant unit tests.
Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

Link to previous PR: #58

Create a new struct `Stream` that uses the `std::io::Write` interface to read chunks of compressed data and write them to an output sink. Add a streaming mode so processing can work with streaming chunks of data. This is required because process() assumed the input reader contained a complete stream. Update flags and try_process_next() were added to handle when the decompressor requests more input bytes than are available. Data is temporarily buffered in the DecoderState if more input bytes are required to make progress. This commit also adds utility functions to the rangecoder for working with streaming data. Adds an allow_incomplete option to disable end of stream checks when calling `finish()` on a stream. This is because some users may want to retrieve partially decompressed data.

cccs-sadugas · 2020-10-05T21:37:40Z

src/decode/lzma.rs

+        // Fill as much of the tmp buffer as possible
+        let start = self.partial_input_buf.position() as usize;
+        let bytes_read =
+            rangecoder.read_into(&mut self.partial_input_buf.get_mut()[start..])? as u64;


CI was failing because Rust 1.32 doesn't have std::convert::TryFrom. I think the cast to u64 is safe here because of the invariant bytes_read <= MAX_REQUIRED_INPUT.

Looks good to me.

cccs-sadugas · 2020-10-05T21:41:08Z

Benchmarks are:

# master
test compress_65536                  ... bench:   1,902,882 ns/iter (+/- 526,492)
test compress_empty                  ... bench:       1,339 ns/iter (+/- 166)
test compress_hello                  ... bench:       1,783 ns/iter (+/- 217)
test decompress_after_compress_65536 ... bench:   1,824,803 ns/iter (+/- 416,725)
test decompress_after_compress_empty ... bench:       3,415 ns/iter (+/- 892)
test decompress_after_compress_hello ... bench:       3,913 ns/iter (+/- 660)
test decompress_big_file             ... bench:   6,070,123 ns/iter (+/- 657,337)
test decompress_huge_dict            ... bench:       3,853 ns/iter (+/- 453)

# branch with default features
test compress_65536                  ... bench:   1,843,732 ns/iter (+/- 166,190)
test compress_empty                  ... bench:       1,335 ns/iter (+/- 156)
test compress_hello                  ... bench:       1,749 ns/iter (+/- 188)
test decompress_after_compress_65536 ... bench:   2,049,825 ns/iter (+/- 394,655)
test decompress_after_compress_empty ... bench:       3,799 ns/iter (+/- 1,432)
test decompress_after_compress_hello ... bench:       4,451 ns/iter (+/- 2,081)
test decompress_big_file             ... bench:   6,201,524 ns/iter (+/- 697,083)
test decompress_huge_dict            ... bench:       3,869 ns/iter (+/- 546)

# branch with all features
test compress_65536                  ... bench:   2,440,331 ns/iter (+/- 188,993)
test compress_empty                  ... bench:       1,538 ns/iter (+/- 420)
test compress_hello                  ... bench:       2,210 ns/iter (+/- 647)
test decompress_after_compress_65536 ... bench:   3,717,059 ns/iter (+/- 608,489)
test decompress_after_compress_empty ... bench:       3,613 ns/iter (+/- 640)
test decompress_after_compress_hello ... bench:       4,572 ns/iter (+/- 458)
test decompress_big_file             ... bench:   7,830,967 ns/iter (+/- 875,957)
test decompress_huge_dict            ... bench:       4,578 ns/iter (+/- 648)
test decompress_stream_big_file      ... bench:   7,470,316 ns/iter (+/- 1,367,116)

gendx · 2020-10-15T14:57:38Z

I'll re-review the changes, and perform benchmarks locally to confirm the performance impact.

From your results, it seems that there could be a slight performance regression even in non-streaming mode, maybe due to the changes in the range coder (update boolean parameter), in which case it could be worth putting that behind the stream feature as well to avoid the regression. If the regression is confirmed but the culprit is not the range coder, producing a flamegraph might be useful as well to see where the regression may come from.

I hope to get back with a more complete reply in the next few days.

GreenReaper · 2020-11-12T21:00:40Z

@gendx: Do you have time to revisit this? ruffle-rs/ruffle#405 (comment) indicates a need for streaming support prior to inclusion. Ruffle is getting more attention due to next month's demise of the Flash browser plugin. There isn't that much lzma-encoded Flash - gzip remained the default - but it exists, and it'd be nice for it to handle them, as well as helping the projects mentioned in #10.

cccs-sadugas · 2020-11-25T23:16:55Z

I'll re-review the changes, and perform benchmarks locally to confirm the performance impact.

From your results, it seems that there could be a slight performance regression even in non-streaming mode, maybe due to the changes in the range coder (update boolean parameter), in which case it could be worth putting that behind the stream feature as well to avoid the regression. If the regression is confirmed but the culprit is not the range coder, producing a flamegraph might be useful as well to see where the regression may come from.

I hope to get back with a more complete reply in the next few days.

@gendx any updates on this pull request?

I agree that the slight performance regression may be caused by the changes in the process function and the update flag as this creates additional code branches.

The first version of my pull request avoided this issue by writing two versions of the process and range coder functions at the cost of code duplication. E.g. parse_reverse_bit_tree with &mut self and parse_reverse_bit_tree_check with &self.

I ran the benchmarks once again today. This shows a larger 8-12% difference for smaller tests and a lower 3% difference for the big file test. Larger input may have better performance due to initial setup costs.

# master compared with this branch (default features)
 name                             master ns/iter  branch-default ns/iter  diff ns/iter  diff %  speedup 
 compress_65536                   1,889,807       1,806,011                    -83,796  -4.43%   x 1.05 
 compress_empty                   1,305           1,259                            -46  -3.52%   x 1.04 
 compress_hello                   1,707           1,710                              3   0.18%   x 1.00 
 decompress_after_compress_65536  1,695,077       1,896,378                    201,301  11.88%   x 0.89 
 decompress_after_compress_empty  3,010           3,074                             64   2.13%   x 0.98 
 decompress_after_compress_hello  3,540           3,817                            277   7.82%   x 0.93 
 decompress_big_file              5,950,045       6,104,193                    154,148   2.59%   x 0.97 
 decompress_huge_dict             3,670           3,672                              2   0.05%   x 1.00

# this branch (default features) compared with all features
 name                             branch-default ns/iter  branch-all-features ns/iter  diff ns/iter  diff %  speedup 
 compress_65536                   1,806,011               1,846,704                          40,693   2.25%   x 0.98 
 compress_empty                   1,259                   1,309                                  50   3.97%   x 0.96 
 compress_hello                   1,710                   1,703                                  -7  -0.41%   x 1.00 
 decompress_after_compress_65536  1,896,378               1,874,422                         -21,956  -1.16%   x 1.01 
 decompress_after_compress_empty  3,074                   3,023                                 -51  -1.66%   x 1.02 
 decompress_after_compress_hello  3,817                   3,560                                -257  -6.73%   x 1.07 
 decompress_big_file              6,104,193               6,149,868                          45,675   0.75%   x 0.99 
 decompress_huge_dict             3,672                   3,647                                 -25  -0.68%   x 1.01 
WARNING: benchmarks in new but not in old: decompress_stream_big_file

# additional bench in this branch with all features
test decompress_stream_big_file      ... bench:   5,936,301 ns/iter (+/- 484,516)

I generated flamegraphs and attached them to this thread. N is the number of test iterations.

+-----------------+---------+----------------------------------+------+
|      name       | branch  |               test               |  N   |
+-----------------+---------+----------------------------------+------+
| master.svg      | master  | decompress_big_file foo.txt.lzma | 1000 |
| master-dac.svg  | master  | decompress_after_compress_65536  |  100 |
| control.svg     | current | decompress_big_file foo.txt.lzma | 1000 |
| control-dac.svg | current | decompress_after_compress_65536  |  100 |
+-----------------+---------+----------------------------------+------+

Flamegraph analysis:

diff.svg shows the difference between master and control +3% overall mainly in read.
diff-dac.svg shows the difference between master-dac and control-dac with +40% more time in decode_bit.

I tried doing some micro optimizations by moving branches but overall this did not create a significant performance increase. On the positive side, the streaming decompressor is more cpu and memory efficient :).

flamegraph.zip

gendx

Sorry again for the delay in reviewing this.

Overall this should be good to go as an experimental streaming feature. I left some minor comments for reference, which can be fixed separately after merging.

Regarding performance, the minor regression is not that large, at least when the feature is off. But further PRs are welcome to improve performance if needed.

However, I'd like to experiment with an async implementation in the future, e.g. over futures::io::AsyncRead and futures::io::AsyncWrite, which should provide a more seamless way of streaming without having to implement the state machine for buffering within lzma-rs, and in particular without having to have this update boolean passed throughout the implementation.

Other improvements could be to make RangeDecoder a trait and integrating the update bit into it (i.e. having UpdatingRangeDecoder and ObservingRangeDecoder that implement the RangeDecoder trait), or to leverage const generics to store the update bit as a const-generic boolean.

gendx · 2020-11-30T22:41:00Z

tests/lzma.rs

+use std::io::Write;
+
+/// Utility function to read a file into memory
+fn read_to_file(filename: &str) -> std::io::Result<Vec<u8>> {


This would better be named read_from_file or read_all_file.

gendx · 2020-11-30T22:44:24Z

tests/lzma.rs

+            let n = compfile.read(&mut tmp).unwrap();
+            stream.write_all(&tmp[0..n]).unwrap();
+
+            if n == 0 {


Break should be just after the function call.

gendx · 2020-11-30T22:45:23Z

tests/lzma.rs

+    }
+}
+
+fn assert_decomp_eq(input: &[u8], expected: &[u8]) {


There's some code duplication with the previous function.

gendx · 2020-11-30T22:47:30Z

src/decode/lzma.rs

+/// Tells the decompressor if we should expect more data after parsing the
+/// current input.
+#[derive(Debug, PartialEq)]
+pub enum ProcessingMode {


This shouldn't be pub.

gendx · 2020-11-30T22:48:25Z

src/decode/lzma.rs

-        let props = input
-            .read_u8()
-            .map_err(|e| error::Error::LZMAError(format!("LZMA header too short: {}", e)))?;
+        let props = input.read_u8().map_err(error::Error::HeaderTooShort)?;


Does it work when streaming in chunks of 1 byte?

gendx · 2020-11-30T22:49:02Z

tests/lzma.rs

+        let mut stream = lzma_rs::decompress::Stream::new(Vec::new());
+
+        // read file in chunks
+        let mut tmp = [0u8; 1024];


This should test chunks of 1 byte as well.

gendx · 2020-11-30T22:53:13Z

src/decode/lzma.rs

+                    && (self.partial_input_buf.position() as usize) < MAX_REQUIRED_INPUT
+                    && self
+                        .try_process_next(
+                            &tmp[0..self.partial_input_buf.position() as usize],


Unneeded zero at the beginning of the range. Same below.

gendx · 2020-11-30T22:55:03Z

src/decode/lzma.rs

+        // Fill as much of the tmp buffer as possible
+        let start = self.partial_input_buf.position() as usize;
+        let bytes_read =
+            rangecoder.read_into(&mut self.partial_input_buf.get_mut()[start..])? as u64;


Looks good to me.

gendx · 2020-11-30T22:56:24Z

src/decode/lzma.rs

+        Ok(())
+    }
+
+    pub fn process_mode<'a, R: io::BufRead>(


This shouldn't be pub.

cccs-sadugas · 2020-12-01T23:06:07Z

Sorry again for the delay in reviewing this.

Overall this should be good to go as an experimental streaming feature. I left some minor comments for reference, which can be fixed separately after merging.

Regarding performance, the minor regression is not that large, at least when the feature is off. But further PRs are welcome to improve performance if needed.

However, I'd like to experiment with an async implementation in the future, e.g. over futures::io::AsyncRead and futures::io::AsyncWrite, which should provide a more seamless way of streaming without having to implement the state machine for buffering within lzma-rs, and in particular without having to have this update boolean passed throughout the implementation.

Other improvements could be to make RangeDecoder a trait and integrating the update bit into it (i.e. having UpdatingRangeDecoder and ObservingRangeDecoder that implement the RangeDecoder trait), or to leverage const generics to store the update bit as a const-generic boolean.

Thanks for the review and merge :). I will look at the comments in the near future. Is there anything else required to publish the added features to crates.io?

I agree that the async traits could be useful and is worth looking into. Some projects using this crate may not have an async runtime or may depend on using the std::io::Write interface. May be worth looking for a way to wrap the async call to turn it into an std::io::Write interface for those users.

gendx · 2020-12-03T19:00:43Z

Thanks for the review and merge :). I will look at the comments in the near future. Is there anything else required to publish the added features to crates.io?

Thanks also a lot for your patience, and sorry again for the reviewing delay, this was a big change and the last months have been quite busy on my end.
I'll give it a few more days before publishing a new version. In particular, it would be good to look at how this PR affects #61 before publishing anything.

I agree that the async traits could be useful and is worth looking into. Some projects using this crate may not have an async runtime or may depend on using the std::io::Write interface. May be worth looking for a way to wrap the async call to turn it into an std::io::Write interface for those users.

The plan would be to have 3 implementations:

The main, blocking one.
The experimental streaming interface introduced here.
Another experimental async interface.

From a maintenance point-of-view, keeping only one of (2) and (3) would be preferable in the long term (depending on actual adoption), but on the other hand it's indeed not always possible/easy to have users pass an AsyncWrite object when this requires providing a fully-fledged async runtime (e.g. in embedded systems? in wasm?).

gendx · 2020-12-12T13:13:10Z

Thanks for the review and merge :). I will look at the comments in the near future. Is there anything else required to publish the added features to crates.io?

For now, the streaming implementation doesn't work for some combination(s) of the test cases and the streaming chunking size, as evidenced by #63, so fixing that is still required before publishing a new release on crates.io.

cccs-sadugas mentioned this pull request Oct 5, 2020

Streaming Decompressor v3 #58

Closed

2 tasks

cccs-sadugas force-pushed the streaming-decompressor-v4 branch 2 times, most recently from 7834b5d to 6406291 Compare October 5, 2020 21:18

cccs-sadugas force-pushed the streaming-decompressor-v4 branch from 6406291 to 2126c78 Compare October 5, 2020 21:32

cccs-sadugas commented Oct 5, 2020

View reviewed changes

gendx approved these changes Nov 30, 2020

View reviewed changes

gendx merged commit e18d466 into gendx:master Nov 30, 2020

gendx mentioned this pull request Dec 3, 2020

Fix some major LZMA2 decompression issues #61

Merged

2 tasks

gendx added a commit that referenced this pull request Dec 12, 2020

Address comments from #59 in src/decode/lzma.rs

5ac3cf2

gendx mentioned this pull request Dec 12, 2020

Prepare 0.2.0 #63

Merged

gendx added a commit that referenced this pull request Jan 28, 2021

Address comments from #59 in src/decode/lzma.rs

80c13bf

gendx added a commit that referenced this pull request Jan 30, 2021

Address comments from #59 in src/decode/lzma.rs

5aa7eb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming Decompressor v4 #59

Streaming Decompressor v4 #59

cccs-sadugas commented Oct 5, 2020 •

edited

Loading

cccs-sadugas Oct 5, 2020

gendx Nov 30, 2020

cccs-sadugas commented Oct 5, 2020

gendx commented Oct 15, 2020

GreenReaper commented Nov 12, 2020

cccs-sadugas commented Nov 25, 2020

gendx left a comment

gendx Nov 30, 2020

gendx Nov 30, 2020

gendx Nov 30, 2020

gendx Nov 30, 2020

gendx Nov 30, 2020

gendx Nov 30, 2020

gendx Nov 30, 2020

gendx Nov 30, 2020

gendx Nov 30, 2020

cccs-sadugas commented Dec 1, 2020

gendx commented Dec 3, 2020

gendx commented Dec 12, 2020

Streaming Decompressor v4 #59

Streaming Decompressor v4 #59

Conversation

cccs-sadugas commented Oct 5, 2020 • edited Loading

Pull Request Overview

Testing Strategy

Supporting Documentation and References

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cccs-sadugas commented Oct 5, 2020

gendx commented Oct 15, 2020

GreenReaper commented Nov 12, 2020

cccs-sadugas commented Nov 25, 2020

gendx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cccs-sadugas commented Dec 1, 2020

gendx commented Dec 3, 2020

gendx commented Dec 12, 2020

cccs-sadugas commented Oct 5, 2020 •

edited

Loading