Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Decompressor #51

Closed

Conversation

cccs-sadugas
Copy link
Contributor

@cccs-sadugas cccs-sadugas commented Jul 13, 2020

Pull Request Overview

This pull request implements issue #10 .

It is currently dependant on #50 .

Testing Strategy

This pull request was tested by...

  • Added relevant unit tests.
  • Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

This implementation is based off of the LzmaDec_TryDummy function in libhtp's port of the LZMA SDK.

TODO

Could hide this under a feature flag upon request.

Benchmarks

I added a benchmark decompress_stream_big_file that shows that streaming decompression is actually after. This could be related better memory management.

CPU: Intel(R) Core(TM) i7-1065G7 @ 1.30 GHz
RAM: 16Gb @ 4267 MHz
OS: Docker Windows Ubuntu 20.04 LTS

master

test compress_65536                  ... bench:   2,689,440 ns/iter (+/- 667,102)
test compress_empty                  ... bench:       1,629 ns/iter (+/- 346)
test compress_hello                  ... bench:       2,278 ns/iter (+/- 430)
test decompress_after_compress_65536 ... bench:   3,655,315 ns/iter (+/- 6,067,710)
test decompress_after_compress_empty ... bench:       3,977 ns/iter (+/- 1,797)
test decompress_after_compress_hello ... bench:       4,633 ns/iter (+/- 1,811)
test decompress_big_file             ... bench:   7,079,015 ns/iter (+/- 2,228,350)
test decompress_huge_dict            ... bench:       5,359 ns/iter (+/- 4,065)

streaming-decompressor

test compress_65536                  ... bench:   2,746,007 ns/iter (+/- 1,179,775)
test compress_empty                  ... bench:       1,618 ns/iter (+/- 1,428)
test compress_hello                  ... bench:       3,729 ns/iter (+/- 4,189)
test decompress_after_compress_65536 ... bench:   3,134,650 ns/iter (+/- 2,263,524)
test decompress_after_compress_empty ... bench:       3,747 ns/iter (+/- 1,575)
test decompress_after_compress_hello ... bench:       4,558 ns/iter (+/- 1,288)
test decompress_big_file             ... bench:   7,212,875 ns/iter (+/- 2,376,085)
test decompress_huge_dict            ... bench:       4,633 ns/iter (+/- 7,197)
test decompress_stream_big_file      ... bench:   6,670,270 ns/iter (+/- 2,605,148)

Adds a memlimit configuration option for decompression. If the dict
buffer's memory limit is exceeded, decompression will fail with an
LZMAError. Additional functions were added to reduce the amount of
breaking changes in the library.
Move the memlimit check so that it only occurs when we are resizing the
buffer.
Changes `LZBuffer` trait to consume the output sink instead of holding a
reference to it. This makes it easier to store the sink and avoids
self-referential structs. It also makes sense for the buffer to own the
sink while it is performing decompression.

This also adds the methods `get_ref` and `get_mut` to access the output
sink.
Add a streaming mode so processing can work with streaming chunks of
data. This is required because process() assumed the input reader
contained a complete stream.

A CheckState, check() methods and try_process_next() were added to
handle when the decompressor requests more input bytes than are
available. Data is temporarily buffered in the DecoderState if more
input bytes are required to make progress.

This commit also adds utility functions to the rangecoder for working
with streaming data.
Creates a new struct `Stream` that uses the `std::io::Write` interface
to read chunks of compressed data and write them to an output sink.
Adds an option to disable end of stream checks when calling `finish()`
on a stream. This is because some users may want to retrieve partially
decompressed data.
@cccs-sadugas cccs-sadugas changed the title Streaming ecompressor Streaming Decompressor Jul 13, 2020
Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you rebase on HEAD now that memlimit is merged?
  • Can you add unit tests that the decompressed output is the same via streaming vs. with the direct interface?
  • Can you make sure Travis-CI passes?

Comment on lines +20 to +21
#[cfg(feature = "enable_logging")]
info!("Compressed {} -> {} bytes", x.len(), compressed.len());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use lzma_info! and similar macros instead of info! here and elsewhere. This handles the cfg(feature = "enable_logging") under the hood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure those macros can be imported from the tests folder. This code was copied from tests/lzma.rs.

Comment on lines +17 to +20
// Get a reference to the output sink
fn get_ref(&self) -> &W;
// Get a mutable reference to the output sink
fn get_mut(&mut self) -> &mut W;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide meaningful names, such as get_output or get_writer or get_sink? Then get_foo_mut.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with get_output and get_output_mut.

{
stream: &'a mut W, // Output sink
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes to the interface taking a Write rather than a &'a mut Write make sense. However, they seem decoupled from this pull request, and would be worth submitting as a separate pull request to study their effect on the benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. See PR #54.

{
_phantom: std::marker::PhantomData<W>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::marker::PhantomData for the whole file would be clearer. Also it will make it easier to support #43.

@@ -92,6 +134,23 @@ where
}
}

#[inline]
pub fn decode_bit_check(&mut self, prob: u16) -> io::Result<bool> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original decode_bit updates the prob value. I don't understand how prob is updated in this new function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this function is to avoid updating prob (i.e. it's not declared as a &mut prob). Otherwise, it does not much differently than the non-_check version.

Why not update prob while checking?

Updating probs could cause a scenario where the state is non-recoverable. I think the updated prob value is only meant to be used in the next iteration of the loop anyway, so it should be fine for checking this iteration.

See reference implementation LzmaDec_TryDummy.

Comment on lines +125 to +128
Err(std::io::Error::new(
std::io::ErrorKind::Other,
"failed to read header",
))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an LZMAError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to implement Display for lzma_rs::error::Error. I put this in a separate PR #53 .


let len = match input.fill_buf() {
Ok(val) => val,
Err(_) => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original I/O error shouldn't be lost. It can provide useful information to the user (network error, invalid permissions to read a file, etc.).

output: W,
mut input: &mut R,
options: &Options,
) -> Result<State<W>, (Option<State<W>>, std::io::Error)> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Option<State<W>>, std::io::Error) looks weird for an error type. Can you extend LZMAError for this use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that would require adding a type parameter to LZMAError, i.e. LZMAError<W>. I think a less intrusive way would be to create an error type inside of stream.rs and returning that:

Result<State<W>, StreamStateError<W>>

/// Test processing all chunk sizes
#[test]
fn test_stream_chunked() {
let small_input = b"Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a useful input to have as a test file, so that other unit tests can use it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test file and include it with include_bytes!("../../tests/files/small.txt") given that it's a small input.

b.iter(|| {
let mut stream = lzma_rs::decompress::Stream::new(Vec::new());
stream.write_all(compressed).unwrap();
stream.finish().unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belive the resulting decompressed output should be the result of the lambda function passed to Bencher::iter, to make sure it's not optimized away. However, other benchmarks don't currently do that, but I can send a pull request to fix that accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@cccs-sadugas
Copy link
Contributor Author

Thanks :). I will address your comments and open a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants