Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Decompressor v4 #59

Merged
merged 1 commit into from
Nov 30, 2020

Conversation

cccs-sadugas
Copy link
Contributor

@cccs-sadugas cccs-sadugas commented Oct 5, 2020

Pull Request Overview

Changes since last PR:

  • removed YODA-style if-statements to improve readability
  • removed do-while style logic to make a more readable loop
  • removed confusing RangeDecoder::remaining function
  • removed unnecessary clippy lint
  • removed range checking when casting to u64
  • renamed Mode to ProcessingMode and added a ProcessingStatus instead of using a bool
  • renamed tmp to partial_input_buf to make it more clear
  • fixed indentation on multi-line byte strings
  • made RangeDecoder members public instead of having getters
  • added more documentation to why unpacked_size_write_none_to_header_and_use_provided_on_read causes a WriteZero error
  • added a test utility function for reading a file into a Vec<u8>
  • added comment as to why MAX_REQUIRED_INPUT is 20 bytes
  • added a utility function to read_partial_input_buf

Testing Strategy

This pull request was tested by...

  • Added relevant unit tests.
  • Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

Link to previous PR: #58

@cccs-sadugas cccs-sadugas mentioned this pull request Oct 5, 2020
2 tasks
@cccs-sadugas cccs-sadugas force-pushed the streaming-decompressor-v4 branch 2 times, most recently from 7834b5d to 6406291 Compare October 5, 2020 21:18
Create a new struct `Stream` that uses the `std::io::Write` interface
to read chunks of compressed data and write them to an output sink.

Add a streaming mode so processing can work with streaming chunks of
data. This is required because process() assumed the input reader
contained a complete stream.

Update flags and try_process_next() were added to handle when the
decompressor requests more input bytes than are available. Data is
temporarily buffered in the DecoderState if more input bytes are
required to make progress.

This commit also adds utility functions to the rangecoder for working
with streaming data.

Adds an allow_incomplete option to disable end of stream checks when
calling `finish()` on a stream. This is because some users may want
to retrieve partially decompressed data.
// Fill as much of the tmp buffer as possible
let start = self.partial_input_buf.position() as usize;
let bytes_read =
rangecoder.read_into(&mut self.partial_input_buf.get_mut()[start..])? as u64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI was failing because Rust 1.32 doesn't have std::convert::TryFrom. I think the cast to u64 is safe here because of the invariant bytes_read <= MAX_REQUIRED_INPUT.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@cccs-sadugas
Copy link
Contributor Author

Benchmarks are:

# master
test compress_65536                  ... bench:   1,902,882 ns/iter (+/- 526,492)
test compress_empty                  ... bench:       1,339 ns/iter (+/- 166)
test compress_hello                  ... bench:       1,783 ns/iter (+/- 217)
test decompress_after_compress_65536 ... bench:   1,824,803 ns/iter (+/- 416,725)
test decompress_after_compress_empty ... bench:       3,415 ns/iter (+/- 892)
test decompress_after_compress_hello ... bench:       3,913 ns/iter (+/- 660)
test decompress_big_file             ... bench:   6,070,123 ns/iter (+/- 657,337)
test decompress_huge_dict            ... bench:       3,853 ns/iter (+/- 453)

# branch with default features
test compress_65536                  ... bench:   1,843,732 ns/iter (+/- 166,190)
test compress_empty                  ... bench:       1,335 ns/iter (+/- 156)
test compress_hello                  ... bench:       1,749 ns/iter (+/- 188)
test decompress_after_compress_65536 ... bench:   2,049,825 ns/iter (+/- 394,655)
test decompress_after_compress_empty ... bench:       3,799 ns/iter (+/- 1,432)
test decompress_after_compress_hello ... bench:       4,451 ns/iter (+/- 2,081)
test decompress_big_file             ... bench:   6,201,524 ns/iter (+/- 697,083)
test decompress_huge_dict            ... bench:       3,869 ns/iter (+/- 546)

# branch with all features
test compress_65536                  ... bench:   2,440,331 ns/iter (+/- 188,993)
test compress_empty                  ... bench:       1,538 ns/iter (+/- 420)
test compress_hello                  ... bench:       2,210 ns/iter (+/- 647)
test decompress_after_compress_65536 ... bench:   3,717,059 ns/iter (+/- 608,489)
test decompress_after_compress_empty ... bench:       3,613 ns/iter (+/- 640)
test decompress_after_compress_hello ... bench:       4,572 ns/iter (+/- 458)
test decompress_big_file             ... bench:   7,830,967 ns/iter (+/- 875,957)
test decompress_huge_dict            ... bench:       4,578 ns/iter (+/- 648)
test decompress_stream_big_file      ... bench:   7,470,316 ns/iter (+/- 1,367,116)

@gendx
Copy link
Owner

gendx commented Oct 15, 2020

I'll re-review the changes, and perform benchmarks locally to confirm the performance impact.

From your results, it seems that there could be a slight performance regression even in non-streaming mode, maybe due to the changes in the range coder (update boolean parameter), in which case it could be worth putting that behind the stream feature as well to avoid the regression. If the regression is confirmed but the culprit is not the range coder, producing a flamegraph might be useful as well to see where the regression may come from.

I hope to get back with a more complete reply in the next few days.

@GreenReaper
Copy link

@gendx: Do you have time to revisit this? ruffle-rs/ruffle#405 (comment) indicates a need for streaming support prior to inclusion. Ruffle is getting more attention due to next month's demise of the Flash browser plugin. There isn't that much lzma-encoded Flash - gzip remained the default - but it exists, and it'd be nice for it to handle them, as well as helping the projects mentioned in #10.

@cccs-sadugas
Copy link
Contributor Author

I'll re-review the changes, and perform benchmarks locally to confirm the performance impact.

From your results, it seems that there could be a slight performance regression even in non-streaming mode, maybe due to the changes in the range coder (update boolean parameter), in which case it could be worth putting that behind the stream feature as well to avoid the regression. If the regression is confirmed but the culprit is not the range coder, producing a flamegraph might be useful as well to see where the regression may come from.

I hope to get back with a more complete reply in the next few days.

@gendx any updates on this pull request?

I agree that the slight performance regression may be caused by the changes in the process function and the update flag as this creates additional code branches.

The first version of my pull request avoided this issue by writing two versions of the process and range coder functions at the cost of code duplication. E.g. parse_reverse_bit_tree with &mut self and parse_reverse_bit_tree_check with &self.

I ran the benchmarks once again today. This shows a larger 8-12% difference for smaller tests and a lower 3% difference for the big file test. Larger input may have better performance due to initial setup costs.

# master compared with this branch (default features)
 name                             master ns/iter  branch-default ns/iter  diff ns/iter  diff %  speedup 
 compress_65536                   1,889,807       1,806,011                    -83,796  -4.43%   x 1.05 
 compress_empty                   1,305           1,259                            -46  -3.52%   x 1.04 
 compress_hello                   1,707           1,710                              3   0.18%   x 1.00 
 decompress_after_compress_65536  1,695,077       1,896,378                    201,301  11.88%   x 0.89 
 decompress_after_compress_empty  3,010           3,074                             64   2.13%   x 0.98 
 decompress_after_compress_hello  3,540           3,817                            277   7.82%   x 0.93 
 decompress_big_file              5,950,045       6,104,193                    154,148   2.59%   x 0.97 
 decompress_huge_dict             3,670           3,672                              2   0.05%   x 1.00 
# this branch (default features) compared with all features
 name                             branch-default ns/iter  branch-all-features ns/iter  diff ns/iter  diff %  speedup 
 compress_65536                   1,806,011               1,846,704                          40,693   2.25%   x 0.98 
 compress_empty                   1,259                   1,309                                  50   3.97%   x 0.96 
 compress_hello                   1,710                   1,703                                  -7  -0.41%   x 1.00 
 decompress_after_compress_65536  1,896,378               1,874,422                         -21,956  -1.16%   x 1.01 
 decompress_after_compress_empty  3,074                   3,023                                 -51  -1.66%   x 1.02 
 decompress_after_compress_hello  3,817                   3,560                                -257  -6.73%   x 1.07 
 decompress_big_file              6,104,193               6,149,868                          45,675   0.75%   x 0.99 
 decompress_huge_dict             3,672                   3,647                                 -25  -0.68%   x 1.01 
WARNING: benchmarks in new but not in old: decompress_stream_big_file
# additional bench in this branch with all features
test decompress_stream_big_file      ... bench:   5,936,301 ns/iter (+/- 484,516)

I generated flamegraphs and attached them to this thread. N is the number of test iterations.

+-----------------+---------+----------------------------------+------+
|      name       | branch  |               test               |  N   |
+-----------------+---------+----------------------------------+------+
| master.svg      | master  | decompress_big_file foo.txt.lzma | 1000 |
| master-dac.svg  | master  | decompress_after_compress_65536  |  100 |
| control.svg     | current | decompress_big_file foo.txt.lzma | 1000 |
| control-dac.svg | current | decompress_after_compress_65536  |  100 |
+-----------------+---------+----------------------------------+------+

Flamegraph analysis:

  • diff.svg shows the difference between master and control +3% overall mainly in read.
  • diff-dac.svg shows the difference between master-dac and control-dac with +40% more time in decode_bit.

I tried doing some micro optimizations by moving branches but overall this did not create a significant performance increase. On the positive side, the streaming decompressor is more cpu and memory efficient :).

flamegraph.zip

Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again for the delay in reviewing this.

Overall this should be good to go as an experimental streaming feature. I left some minor comments for reference, which can be fixed separately after merging.

Regarding performance, the minor regression is not that large, at least when the feature is off. But further PRs are welcome to improve performance if needed.

However, I'd like to experiment with an async implementation in the future, e.g. over futures::io::AsyncRead and futures::io::AsyncWrite, which should provide a more seamless way of streaming without having to implement the state machine for buffering within lzma-rs, and in particular without having to have this update boolean passed throughout the implementation.

Other improvements could be to make RangeDecoder a trait and integrating the update bit into it (i.e. having UpdatingRangeDecoder and ObservingRangeDecoder that implement the RangeDecoder trait), or to leverage const generics to store the update bit as a const-generic boolean.

use std::io::Write;

/// Utility function to read a file into memory
fn read_to_file(filename: &str) -> std::io::Result<Vec<u8>> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would better be named read_from_file or read_all_file.

let n = compfile.read(&mut tmp).unwrap();
stream.write_all(&tmp[0..n]).unwrap();

if n == 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break should be just after the function call.

}
}

fn assert_decomp_eq(input: &[u8], expected: &[u8]) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some code duplication with the previous function.

/// Tells the decompressor if we should expect more data after parsing the
/// current input.
#[derive(Debug, PartialEq)]
pub enum ProcessingMode {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be pub.

let props = input
.read_u8()
.map_err(|e| error::Error::LZMAError(format!("LZMA header too short: {}", e)))?;
let props = input.read_u8().map_err(error::Error::HeaderTooShort)?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work when streaming in chunks of 1 byte?

let mut stream = lzma_rs::decompress::Stream::new(Vec::new());

// read file in chunks
let mut tmp = [0u8; 1024];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should test chunks of 1 byte as well.

&& (self.partial_input_buf.position() as usize) < MAX_REQUIRED_INPUT
&& self
.try_process_next(
&tmp[0..self.partial_input_buf.position() as usize],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unneeded zero at the beginning of the range. Same below.

// Fill as much of the tmp buffer as possible
let start = self.partial_input_buf.position() as usize;
let bytes_read =
rangecoder.read_into(&mut self.partial_input_buf.get_mut()[start..])? as u64;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Ok(())
}

pub fn process_mode<'a, R: io::BufRead>(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be pub.

@gendx gendx merged commit e18d466 into gendx:master Nov 30, 2020
@cccs-sadugas
Copy link
Contributor Author

Sorry again for the delay in reviewing this.

Overall this should be good to go as an experimental streaming feature. I left some minor comments for reference, which can be fixed separately after merging.

Regarding performance, the minor regression is not that large, at least when the feature is off. But further PRs are welcome to improve performance if needed.

However, I'd like to experiment with an async implementation in the future, e.g. over futures::io::AsyncRead and futures::io::AsyncWrite, which should provide a more seamless way of streaming without having to implement the state machine for buffering within lzma-rs, and in particular without having to have this update boolean passed throughout the implementation.

Other improvements could be to make RangeDecoder a trait and integrating the update bit into it (i.e. having UpdatingRangeDecoder and ObservingRangeDecoder that implement the RangeDecoder trait), or to leverage const generics to store the update bit as a const-generic boolean.

Thanks for the review and merge :). I will look at the comments in the near future. Is there anything else required to publish the added features to crates.io?

I agree that the async traits could be useful and is worth looking into. Some projects using this crate may not have an async runtime or may depend on using the std::io::Write interface. May be worth looking for a way to wrap the async call to turn it into an std::io::Write interface for those users.

@gendx
Copy link
Owner

gendx commented Dec 3, 2020

Thanks for the review and merge :). I will look at the comments in the near future. Is there anything else required to publish the added features to crates.io?

Thanks also a lot for your patience, and sorry again for the reviewing delay, this was a big change and the last months have been quite busy on my end.
I'll give it a few more days before publishing a new version. In particular, it would be good to look at how this PR affects #61 before publishing anything.

I agree that the async traits could be useful and is worth looking into. Some projects using this crate may not have an async runtime or may depend on using the std::io::Write interface. May be worth looking for a way to wrap the async call to turn it into an std::io::Write interface for those users.

The plan would be to have 3 implementations:

  1. The main, blocking one.
  2. The experimental streaming interface introduced here.
  3. Another experimental async interface.

From a maintenance point-of-view, keeping only one of (2) and (3) would be preferable in the long term (depending on actual adoption), but on the other hand it's indeed not always possible/easy to have users pass an AsyncWrite object when this requires providing a fully-fledged async runtime (e.g. in embedded systems? in wasm?).

gendx added a commit that referenced this pull request Dec 12, 2020
@gendx gendx mentioned this pull request Dec 12, 2020
@gendx
Copy link
Owner

gendx commented Dec 12, 2020

Thanks for the review and merge :). I will look at the comments in the near future. Is there anything else required to publish the added features to crates.io?

For now, the streaming implementation doesn't work for some combination(s) of the test cases and the streaming chunking size, as evidenced by #63, so fixing that is still required before publishing a new release on crates.io.

gendx added a commit that referenced this pull request Jan 28, 2021
gendx added a commit that referenced this pull request Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants