Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not raise error if code equals range in get_bit #16

Merged
merged 8 commits into from Dec 16, 2019

Conversation

dragly
Copy link
Contributor

@dragly dragly commented Dec 9, 2019

It appears that code being equal to range is a valid state and that some files can have
data encoded like this. An example of such a file has been added to the
tests.

Pull Request Overview

This pull request fixes #15

Testing Strategy

This pull request was tested by...

  • Added relevant unit tests.
  • Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request.

TODO or Help Wanted

None

It appears that this is a valid state and that some files can have
data encoded like this. An example of such a file has been added to the
tests.

Fixes gendx#15
@gendx
Copy link
Owner

gendx commented Dec 9, 2019

Thanks for your contribution!

One thing I wanted to do but didn't have the time yet was to have tests comparing the implementation to established libraries such as https://crates.io/crates/lzma-sys or https://crates.io/crates/xz2 (backed by a reference C implementation). That is, the decomp_big_file test should also check that we get the same result as the system library.

I don't know how hard it would be to decompress a raw LZMA stream with these crates, as opposed to an xz file. If you manage to implement something like that, it would be very helpful in testing the correctness of such changes!

Otherwise, I'll get back to the details when I have more time. But if you have a link to another LZMA implementation where the code == range case is handled as you describe that would also speed things up :)

@dragly
Copy link
Contributor Author

dragly commented Dec 9, 2019

Ah, yes. Sorry, I intended to link to the relevant code in XZ Utils, but forgot to do so.

I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/rangecoder/range_decoder.h#L175

It is used from here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/lzma/lzma_decoder.c#L625

As you can see, there is no test here, but a few additional operations after subtracting the range from the code. I am not sure what they correspond to in lzma_rs.

I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well. We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? 😅 There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)

@gendx
Copy link
Owner

gendx commented Dec 9, 2019

I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/rangecoder/range_decoder.h#L175

Thanks, this file will be helpful in double-checking how the range coder works in detail :)

I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well.

On this topic, turns out that I already wrote a dumb xz encoder https://github.com/gendx/lzma-rs/blob/master/src/encode/xz.rs, that currently packages an LZMA2 stream which itself encapsulates raw uncompressed data. But it shouldn't be too hard to turn that into a packaging tool that takes an already encoded LZMA stream and packages it as an xz archive, which would allow easier comparison with other libraries.

We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.

By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? sweat_smile There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)

I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.

Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?

I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)

Comment on lines 9 to 11
This is a file that causes the code and range to be equal at some point during decoding LZMA data.
Previously, this file would raise an `LZMAError("Corrupted range coding")`,
although the file is a valid LZMA file.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document how you generated this file? From the looks of it, it's not really "random" in the sense of head -c N /dev/urandom. Or is the file more structured than that (you mentioned the OpenCTM format, could you add a link to what OpenCTM is)?

My concerns are that:

  • The file is quite large, which will impact the size of the repository.
  • Given the file size, I also want to make sure that the license for the file's contents is compatible with the repository's license (MIT), and that it doesn't contain any sensitive/personal/private data.

Otherwise I'm wondering how many small random files would be necessary to reproduce the code == range case. Or whether the dumb encoder could be used to generate this case with a much smaller file (e.g. with an all-zeros or all-ones file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to hear that you are making sure it adheres to the license because I actually put down a bit of work to make it sure it does.

It is not entirely random, but not so far from it. It comes from this beauty of a Blender scene I created using a mix of the Array and Build modifier and a lot of duplication.

image

It has subsequently been optimized and packed into multiple OpenCTM files. After that, I manually extracted the LZMA-encoded part (the vertices) of a bad file and added the unpacked size to the binary, which is missing in OpenCTM (see #17).

The reason the file was created was exactly because I wanted to reproduce the problem in a file that did not contain sensitive or private data. The problem was originally discovered in other files that I do not have the liberty to share freely ;)

So the licensing is no problem - we are happy to share the attached file under the MIT license or any other license of your choosing.

I am not sure if I am able to reproduce the issue in a smaller file. I tried writing some "random" data with a similar structure (a lot of zeros at the start of the file), but was unable to reproduce the issue.

@dragly
Copy link
Contributor Author

dragly commented Dec 10, 2019

We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.

That would be great! Performance is in fact very important to us. I have actually been looking into optimizing lzma_rs after profiling a bit. If I recall correctly, it is about 3x-6x slower than xz when running natively and about as fast as lzma-js when running in the browser.

I think what I found was that there was a fair bit of allocation and deallocation going on when we were parsing a large number of files/chunks. I was therefore thinking about restructuring the code so the decoder could be re-used instead of set up and torn down for each file/chunk. It might be that this is not so important if the performance impact is actually coming from the log-statements. I actually did not know that those were not optimized out in a release build.

I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.

Cool! I will check that out!

Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?

Not that I know of. That is partially why I asked, in case you knew of some resourced I had not found ;)

I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)

That would be awesome! I was thinking about doing something similar: To write down what I learned along the way while trying to understand more of LZMA. However, I do not think my understanding is anywhere near a level where I can confidently explain how it all works :)

Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I checked against https://github.com/xz-mirror/xz/tree/master/src/liblzma/lzma and https://github.com/jljusten/LZMA-SDK, which both have the behavior you describe.

More tests can be added later once #20 is implemented, and once someone figures out how to unit test this.

In the meantime, there's no need to block this pull request, I just have a comment on the phrasing.

tests/files/README.md Outdated Show resolved Hide resolved
@gendx
Copy link
Owner

gendx commented Dec 16, 2019

bors r+

@bors
Copy link
Contributor

bors bot commented Dec 16, 2019

Merge conflict

@gendx
Copy link
Owner

gendx commented Dec 16, 2019

bors r+

bors bot added a commit that referenced this pull request Dec 16, 2019
16: Do not raise error if code equals range in get_bit r=gendx a=dragly

It appears that code being equal to range is a valid state and that some files can have 
data encoded like this. An example of such a file has been added to the
tests.

### Pull Request Overview

This pull request fixes #15 

### Testing Strategy

This pull request was tested by...

- [ ] Added relevant unit tests.
- [x] Added relevant end-to-end tests (such as `.lzma`, `.lzma2`, `.xz` files).


### Supporting Documentation and References

The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request.

### TODO or Help Wanted

None


Co-authored-by: Svenn-Arne Dragly <dragly@cognite.com>
Co-authored-by: gendx <gendx@users.noreply.github.com>
Co-authored-by: Svenn-Arne Dragly <s@dragly.com>
@bors
Copy link
Contributor

bors bot commented Dec 16, 2019

Build succeeded

@bors bors bot merged commit 0ee7b55 into gendx:master Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seemingly valid lzma files result in "Corrupted range coding"
2 participants