-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming Decompressor #51
Conversation
Adds a memlimit configuration option for decompression. If the dict buffer's memory limit is exceeded, decompression will fail with an LZMAError. Additional functions were added to reduce the amount of breaking changes in the library.
Move the memlimit check so that it only occurs when we are resizing the buffer.
Changes `LZBuffer` trait to consume the output sink instead of holding a reference to it. This makes it easier to store the sink and avoids self-referential structs. It also makes sense for the buffer to own the sink while it is performing decompression. This also adds the methods `get_ref` and `get_mut` to access the output sink.
Add a streaming mode so processing can work with streaming chunks of data. This is required because process() assumed the input reader contained a complete stream. A CheckState, check() methods and try_process_next() were added to handle when the decompressor requests more input bytes than are available. Data is temporarily buffered in the DecoderState if more input bytes are required to make progress. This commit also adds utility functions to the rangecoder for working with streaming data.
Creates a new struct `Stream` that uses the `std::io::Write` interface to read chunks of compressed data and write them to an output sink.
Adds an option to disable end of stream checks when calling `finish()` on a stream. This is because some users may want to retrieve partially decompressed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you rebase on HEAD now that memlimit is merged?
- Can you add unit tests that the decompressed output is the same via streaming vs. with the direct interface?
- Can you make sure Travis-CI passes?
#[cfg(feature = "enable_logging")] | ||
info!("Compressed {} -> {} bytes", x.len(), compressed.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use lzma_info!
and similar macros instead of info!
here and elsewhere. This handles the cfg(feature = "enable_logging")
under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure those macros can be imported from the tests folder. This code was copied from tests/lzma.rs
.
// Get a reference to the output sink | ||
fn get_ref(&self) -> &W; | ||
// Get a mutable reference to the output sink | ||
fn get_mut(&mut self) -> &mut W; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide meaningful names, such as get_output
or get_writer
or get_sink
? Then get_foo_mut
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with get_output
and get_output_mut
.
{ | ||
stream: &'a mut W, // Output sink |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes to the interface taking a Write
rather than a &'a mut Write
make sense. However, they seem decoupled from this pull request, and would be worth submitting as a separate pull request to study their effect on the benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. See PR #54.
{ | ||
_phantom: std::marker::PhantomData<W>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::marker::PhantomData
for the whole file would be clearer. Also it will make it easier to support #43.
@@ -92,6 +134,23 @@ where | |||
} | |||
} | |||
|
|||
#[inline] | |||
pub fn decode_bit_check(&mut self, prob: u16) -> io::Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original decode_bit
updates the prob
value. I don't understand how prob
is updated in this new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this function is to avoid updating prob (i.e. it's not declared as a &mut prob
). Otherwise, it does not much differently than the non-_check
version.
Why not update prob while checking?
Updating probs could cause a scenario where the state is non-recoverable. I think the updated prob value is only meant to be used in the next iteration of the loop anyway, so it should be fine for checking this iteration.
See reference implementation LzmaDec_TryDummy.
Err(std::io::Error::new( | ||
std::io::ErrorKind::Other, | ||
"failed to read header", | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an LZMAError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to implement Display
for lzma_rs::error::Error
. I put this in a separate PR #53 .
|
||
let len = match input.fill_buf() { | ||
Ok(val) => val, | ||
Err(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original I/O error shouldn't be lost. It can provide useful information to the user (network error, invalid permissions to read a file, etc.).
output: W, | ||
mut input: &mut R, | ||
options: &Options, | ||
) -> Result<State<W>, (Option<State<W>>, std::io::Error)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Option<State<W>>, std::io::Error)
looks weird for an error type. Can you extend LZMAError
for this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that would require adding a type parameter to LZMAError
, i.e. LZMAError<W>
. I think a less intrusive way would be to create an error type inside of stream.rs
and returning that:
Result<State<W>, StreamStateError<W>>
/// Test processing all chunk sizes | ||
#[test] | ||
fn test_stream_chunked() { | ||
let small_input = b"Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a useful input to have as a test file, so that other unit tests can use it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test file and include it with include_bytes!("../../tests/files/small.txt")
given that it's a small input.
b.iter(|| { | ||
let mut stream = lzma_rs::decompress::Stream::new(Vec::new()); | ||
stream.write_all(compressed).unwrap(); | ||
stream.finish().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belive the resulting decompressed output should be the result of the lambda function passed to Bencher::iter
, to make sure it's not optimized away. However, other benchmarks don't currently do that, but I can send a pull request to fix that accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
Thanks :). I will address your comments and open a new PR. |
Pull Request Overview
This pull request implements issue #10 .
It is currently dependant on #50 .
Testing Strategy
This pull request was tested by...
.lzma
,.lzma2
,.xz
files).Supporting Documentation and References
This implementation is based off of the LzmaDec_TryDummy function in libhtp's port of the LZMA SDK.
TODO
Could hide this under a feature flag upon request.
Benchmarks
I added a benchmark
decompress_stream_big_file
that shows that streaming decompression is actually after. This could be related better memory management.CPU: Intel(R) Core(TM) i7-1065G7 @ 1.30 GHz
RAM: 16Gb @ 4267 MHz
OS: Docker Windows Ubuntu 20.04 LTS
master
streaming-decompressor