Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for specifying the unpacked size outside of header #17

Merged
merged 7 commits into from
Dec 16, 2019

Conversation

dragly
Copy link
Contributor

@dragly dragly commented Dec 9, 2019

Some LZMA streams, such as those used in OpenCTM, are encoded without
the unpacked size specified in the header. This is possible to read in
some of the C implementations of LZMA by specifying a header size and
providing the unpacked size as an option to the decoder.

This change adds the same possibility to lzma_rs in a typesafe manner,
where the unpacked size and whether it should be written to the header
is specified explicitly.

Pull Request Overview

This pull request adds Options objects to two new public modules named compress and decompress that can be used to specify whether the unpacked size should be written to and read from the LZMA header.

Testing Strategy

This pull request was tested by...

  • Added relevant unit tests.
  • Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

This exotic use of the LZMA header is only indicated in the OpenCTM specification itself, where it specifically states that the offset of the stream is only 9 bytes, while it should have been 17 bytes with the unpacked size in place. This is based on 4 bytes from OpenCTM itself, 1 byte for the props, 4 bytes for the dict size and the missing 8 bytes for the unpacked size.

It can also be seen from the OpenCTM source code that the LZMA header written is only 5 bytes:

https://github.com/Danny02/OpenCTM/blob/243a343bd23bbeef8731f06ed91e3996604e1af4/lib/stream.c#L311

TODO or Help Wanted

This pull request still needs a bit of bikeshedding on the names of the modules and enums used in the options :)

Some LZMA streams, such as those used in OpenCTM, are encoded without
the unpacked size specified in the header. This is possible to read in
some of the C implementations of LZMA by specifying a header size and
providing the unpacked size as an option to the decoder.

This change adds the same possibility to lzma_rs in a typesafe manner,
where the unpacked size and whether it should be written to the header
is specified explicitly.
Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great :) Just a few comments.

src/decode/options.rs Outdated Show resolved Hide resolved
src/decode/options.rs Outdated Show resolved Hide resolved
src/decode/options.rs Outdated Show resolved Hide resolved
src/encode/options.rs Outdated Show resolved Hide resolved
src/encode/options.rs Outdated Show resolved Hide resolved
src/encode/options.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
tests/lzma.rs Outdated Show resolved Hide resolved
tests/lzma.rs Show resolved Hide resolved
@dragly
Copy link
Contributor Author

dragly commented Dec 11, 2019

Thanks for the comments!

After writing the tests, I realized that the end-of-stream marker is always written. However, this is documented in XZ as only being supposed to be written if it is not provided in the header.

We should probably not write this if the value is provided and written in the header. Is removing the bits encoded in Encoder::finish the way to achieve this?

// Write end-of-stream marker
let pos_state = input_len & 3;
// Match
self.rangecoder
.encode_bit(&mut self.is_match[pos_state], true)?;
// New distance
self.rangecoder.encode_bit(&mut 0x400, false)?;
// Dummy len, as small as possible (len = 0)
for _ in 0..4 {
self.rangecoder.encode_bit(&mut 0x400, false)?;
}
// Distance marker = 0xFFFFFFFF
// pos_slot = 63
for _ in 0..6 {
self.rangecoder.encode_bit(&mut 0x400, true)?;
}
// num_direct_bits = 30
// result = 3 << 30 = C000_0000
// + 3FFF_FFF0 (26 bits)
// + F ( 4 bits)
for _ in 0..30 {
self.rangecoder.encode_bit(&mut 0x400, true)?;
}
// = FFFF_FFFF

Not sure exactly what should be done if it is known, but not provided, like in the case of OpenCTM. I will see if I can figure out what happens in other libs in that case.

@gendx
Copy link
Owner

gendx commented Dec 11, 2019

Thanks for the comments!

After writing the tests, I realized that the end-of-stream marker is always written. However, this is documented in XZ as only being supposed to be written if it is not provided in the header.

We should probably not write this if the value is provided and written in the header. Is removing the bits encoded in Encoder::finish the way to achieve this?

// Write end-of-stream marker
let pos_state = input_len & 3;
// Match
self.rangecoder
.encode_bit(&mut self.is_match[pos_state], true)?;
// New distance
self.rangecoder.encode_bit(&mut 0x400, false)?;
// Dummy len, as small as possible (len = 0)
for _ in 0..4 {
self.rangecoder.encode_bit(&mut 0x400, false)?;
}
// Distance marker = 0xFFFFFFFF
// pos_slot = 63
for _ in 0..6 {
self.rangecoder.encode_bit(&mut 0x400, true)?;
}
// num_direct_bits = 30
// result = 3 << 30 = C000_0000
// + 3FFF_FFF0 (26 bits)
// + F ( 4 bits)
for _ in 0..30 {
self.rangecoder.encode_bit(&mut 0x400, true)?;
}
// = FFFF_FFFF

Not sure exactly what should be done if it is known, but not provided, like in the case of OpenCTM. I will see if I can figure out what happens in other libs in that case.

As the name goes, the dumbencoder is currently quite dumb. But I guess removing that part of the Encoder::finish would work. You can also test the result against unlzma in the command line to see if that works.

Also, make the Options derive Clone and Debug.
These tests do not make sense because they provided a value (None) that
indicated an end marker to be expected in cases where there should be
none.
@dragly
Copy link
Contributor Author

dragly commented Dec 13, 2019

Seems like it works fine with unlzma, so I removed writing the end marker if the unpacked size is written to the header.

I also realized that a couple of the tests did not really make sense after this and removed those.

@gendx
Copy link
Owner

gendx commented Dec 16, 2019

bors r+

bors bot added a commit that referenced this pull request Dec 16, 2019
17: Add support for specifying the unpacked size outside of header r=gendx a=dragly

Some LZMA streams, such as those used in OpenCTM, are encoded without
the unpacked size specified in the header. This is possible to read in
some of the C implementations of LZMA by specifying a header size and
providing the unpacked size as an option to the decoder.

This change adds the same possibility to lzma_rs in a typesafe manner,
where the unpacked size and whether it should be written to the header
is specified explicitly.

### Pull Request Overview

This pull request adds `Options` objects to two new public modules named `compress` and `decompress` that can be used to specify whether the unpacked size should be written to and read from the LZMA header.

### Testing Strategy

This pull request was tested by...

- [x] Added relevant unit tests.
- [ ] Added relevant end-to-end tests (such as `.lzma`, `.lzma2`, `.xz` files).


### Supporting Documentation and References

This exotic use of the LZMA header is only indicated in the [OpenCTM specification itself](http://openctm.sourceforge.net/media/FormatSpecification.pdf), where it specifically states that the offset of the stream [is only 9 bytes](https://github.com/Danny02/OpenCTM/blob/243a343bd23bbeef8731f06ed91e3996604e1af4/doc/FormatSpecification.tex#L91), while it should have been 17 bytes with the unpacked size in place. This is based on 4 bytes from OpenCTM itself, 1 byte for the props, 4 bytes for the dict size and the missing 8 bytes for the unpacked size.

It can also be seen from the OpenCTM source code that the LZMA header written is only 5 bytes:

https://github.com/Danny02/OpenCTM/blob/243a343bd23bbeef8731f06ed91e3996604e1af4/lib/stream.c#L311

### TODO or Help Wanted

This pull request still needs a bit of bikeshedding on the names of the modules and enums used in the options :)


Co-authored-by: Svenn-Arne Dragly <s@dragly.com>
Co-authored-by: Svenn-Arne Dragly <dragly@cognite.com>
Co-authored-by: G. Endignoux <ggendx@gmail.com>
@bors
Copy link
Contributor

bors bot commented Dec 16, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants