Do not raise error if code equals range in get_bit #16

dragly · 2019-12-09T14:43:24Z

It appears that code being equal to range is a valid state and that some files can have
data encoded like this. An example of such a file has been added to the
tests.

Pull Request Overview

This pull request fixes #15

Testing Strategy

This pull request was tested by...

Added relevant unit tests.
Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request.

TODO or Help Wanted

None

It appears that this is a valid state and that some files can have data encoded like this. An example of such a file has been added to the tests. Fixes gendx#15

gendx · 2019-12-09T16:15:00Z

Thanks for your contribution!

One thing I wanted to do but didn't have the time yet was to have tests comparing the implementation to established libraries such as https://crates.io/crates/lzma-sys or https://crates.io/crates/xz2 (backed by a reference C implementation). That is, the decomp_big_file test should also check that we get the same result as the system library.

I don't know how hard it would be to decompress a raw LZMA stream with these crates, as opposed to an xz file. If you manage to implement something like that, it would be very helpful in testing the correctness of such changes!

Otherwise, I'll get back to the details when I have more time. But if you have a link to another LZMA implementation where the code == range case is handled as you describe that would also speed things up :)

dragly · 2019-12-09T16:56:27Z

Ah, yes. Sorry, I intended to link to the relevant code in XZ Utils, but forgot to do so.

I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/rangecoder/range_decoder.h#L175

It is used from here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/lzma/lzma_decoder.c#L625

As you can see, there is no test here, but a few additional operations after subtracting the range from the code. I am not sure what they correspond to in lzma_rs.

I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well. We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? 😅 There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)

gendx · 2019-12-09T23:41:03Z

I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/rangecoder/range_decoder.h#L175

Thanks, this file will be helpful in double-checking how the range coder works in detail :)

I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well.

On this topic, turns out that I already wrote a dumb xz encoder https://github.com/gendx/lzma-rs/blob/master/src/encode/xz.rs, that currently packages an LZMA2 stream which itself encapsulates raw uncompressed data. But it shouldn't be too hard to turn that into a packaging tool that takes an already encoded LZMA stream and packages it as an xz archive, which would allow easier comparison with other libraries.

We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.

By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? sweat_smile There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)

I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.

Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?

I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)

gendx · 2019-12-10T00:02:44Z

tests/files/README.md

+This is a file that causes the code and range to be equal at some point during decoding LZMA data.
+Previously, this file would raise an `LZMAError("Corrupted range coding")`, 
+although the file is a valid LZMA file.


Could you document how you generated this file? From the looks of it, it's not really "random" in the sense of head -c N /dev/urandom. Or is the file more structured than that (you mentioned the OpenCTM format, could you add a link to what OpenCTM is)?

My concerns are that:

The file is quite large, which will impact the size of the repository.

Given the file size, I also want to make sure that the license for the file's contents is compatible with the repository's license (MIT), and that it doesn't contain any sensitive/personal/private data.

Otherwise I'm wondering how many small random files would be necessary to reproduce the code == range case. Or whether the dumb encoder could be used to generate this case with a much smaller file (e.g. with an all-zeros or all-ones file).

Glad to hear that you are making sure it adheres to the license because I actually put down a bit of work to make it sure it does.

It is not entirely random, but not so far from it. It comes from this beauty of a Blender scene I created using a mix of the Array and Build modifier and a lot of duplication.

It has subsequently been optimized and packed into multiple OpenCTM files. After that, I manually extracted the LZMA-encoded part (the vertices) of a bad file and added the unpacked size to the binary, which is missing in OpenCTM (see #17).

The reason the file was created was exactly because I wanted to reproduce the problem in a file that did not contain sensitive or private data. The problem was originally discovered in other files that I do not have the liberty to share freely ;)

So the licensing is no problem - we are happy to share the attached file under the MIT license or any other license of your choosing.

I am not sure if I am able to reproduce the issue in a smaller file. I tried writing some "random" data with a similar structure (a lot of zeros at the start of the file), but was unable to reproduce the issue.

dragly · 2019-12-10T09:24:22Z

We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.

That would be great! Performance is in fact very important to us. I have actually been looking into optimizing lzma_rs after profiling a bit. If I recall correctly, it is about 3x-6x slower than xz when running natively and about as fast as lzma-js when running in the browser.

I think what I found was that there was a fair bit of allocation and deallocation going on when we were parsing a large number of files/chunks. I was therefore thinking about restructuring the code so the decoder could be re-used instead of set up and torn down for each file/chunk. It might be that this is not so important if the performance impact is actually coming from the log-statements. I actually did not know that those were not optimized out in a release build.

I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.

Cool! I will check that out!

Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?

Not that I know of. That is partially why I asked, in case you knew of some resourced I had not found ;)

I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)

That would be awesome! I was thinking about doing something similar: To write down what I learned along the way while trying to understand more of LZMA. However, I do not think my understanding is anywhere near a level where I can confidently explain how it all works :)

gendx

Overall looks good. I checked against https://github.com/xz-mirror/xz/tree/master/src/liblzma/lzma and https://github.com/jljusten/LZMA-SDK, which both have the behavior you describe.

More tests can be added later once #20 is implemented, and once someone figures out how to unit test this.

In the meantime, there's no need to block this pull request, I just have a comment on the phrasing.

tests/files/README.md

gendx · 2019-12-16T22:04:34Z

bors r+

bors · 2019-12-16T22:10:36Z

Merge conflict

gendx · 2019-12-16T22:16:01Z

bors r+

16: Do not raise error if code equals range in get_bit r=gendx a=dragly It appears that code being equal to range is a valid state and that some files can have data encoded like this. An example of such a file has been added to the tests. ### Pull Request Overview This pull request fixes #15 ### Testing Strategy This pull request was tested by... - [ ] Added relevant unit tests. - [x] Added relevant end-to-end tests (such as `.lzma`, `.lzma2`, `.xz` files). ### Supporting Documentation and References The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request. ### TODO or Help Wanted None Co-authored-by: Svenn-Arne Dragly <dragly@cognite.com> Co-authored-by: gendx <gendx@users.noreply.github.com> Co-authored-by: Svenn-Arne Dragly <s@dragly.com>

bors · 2019-12-16T23:14:55Z

Build succeeded

continuous-integration/travis-ci/push

Do not raise error if code equals range in get_bit

30b4b89

It appears that this is a valid state and that some files can have data encoded like this. An example of such a file has been added to the tests. Fixes gendx#15

gendx reviewed Dec 10, 2019

View reviewed changes

Update README.md

194565e

gendx requested changes Dec 11, 2019

View reviewed changes

tests/files/README.md Outdated Show resolved Hide resolved

dragly and others added 5 commits December 13, 2019 21:36

Update README and rename range coder edge case file

9e933e1

cargo fmt

98e7f39

Update README.md

c82174c

Update lzma.rs

0745902

Fix typo in tests/lzma.rs

b8c9f8f

gendx approved these changes Dec 16, 2019

View reviewed changes

Merge branch 'master' into dragly/fix-corrupted-range-coding

0ee7b55

bors bot merged commit 0ee7b55 into gendx:master Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not raise error if code equals range in get_bit #16

Do not raise error if code equals range in get_bit #16

dragly commented Dec 9, 2019

gendx commented Dec 9, 2019

dragly commented Dec 9, 2019

gendx commented Dec 9, 2019

gendx Dec 10, 2019

dragly Dec 10, 2019

dragly commented Dec 10, 2019

gendx left a comment

gendx commented Dec 16, 2019

bors bot commented Dec 16, 2019

gendx commented Dec 16, 2019

bors bot commented Dec 16, 2019

Do not raise error if code equals range in get_bit #16

Do not raise error if code equals range in get_bit #16

Conversation

dragly commented Dec 9, 2019

Pull Request Overview

Testing Strategy

Supporting Documentation and References

TODO or Help Wanted

gendx commented Dec 9, 2019

dragly commented Dec 9, 2019

gendx commented Dec 9, 2019

gendx Dec 10, 2019

Choose a reason for hiding this comment

dragly Dec 10, 2019

Choose a reason for hiding this comment

dragly commented Dec 10, 2019

gendx left a comment

Choose a reason for hiding this comment

gendx commented Dec 16, 2019

bors bot commented Dec 16, 2019

Merge conflict

gendx commented Dec 16, 2019

bors bot commented Dec 16, 2019

Build succeeded