Add decoder fuzzing #34

Shnatsel · 2020-04-16T12:33:27Z

Pull Request Overview

Add fuzzing targets for decoding each format
Ignore all xz checksums when fuzzing because the fuzzer cannot create files with valid checksums
Run all fuzzing targets overnight to compile a corpus of interesting inputs
- xz was seeded with official xz test suite (in public domain)
Add fuzzing target for comparing xz decoding to liblzma (via xz2 wrapper crate)
- partly address Add tests against reference libraries #20
Add README with an overview of fuzzing

Testing Strategy

This entire pull request is tests. They seem to detect failures because they've discovered #32.

They haven't found a single bug in decoding, which is completely unexpected. This is the first time I see a codebase that has not been extensively fuzzed previously not fail under fuzzing. But the coverage figures seem to be about right, and failure detection clearly works.

There are some mismatches discovered between liblzma and lzma-rs (see #35).

Supporting Documentation and References

https://rust-fuzz.github.io/book/introduction.html

TODO or Help Wanted

For some reason fuzzing lzma1 decoding is 4x slower than lzma2.

I didn't compare lzma1 decoding to xz2 crate because I'm not sure which variant it xz2 crate uses. It should be trivial to add later if desired.

Continuous fuzzing via e.g. https://fuzzit.dev/ would be very welcome, but can only be set up by maintainers.

It would be nice to advertise in the README that the decoding has been tested to produce identical results to liblzma once the mismatches are fixed (#35).

…allows random data from fuzzer to reach actually interesting code.

…iblzma wrapper)

…e (public domain), the rest with files from tests

…ical coverage

…t xz2

Shnatsel · 2020-04-16T15:48:16Z

I've found a bug in comparison between liblzma and lzma-rs. I've pushed a fix, it does detect some behavior mismatches now.

gendx

Can you remove the corpus from this pull request (including the git history), or make a separate pull request without the corpus, to make it easier to review and to keep the git repository clean?

Pull Request Overview

Add fuzzing targets for decoding each format

Ignore all xz checksums when fuzzing because the fuzzer cannot create files with valid checksums

Run all fuzzing targets overnight to compile a corpus of interesting inputs

xz was seeded with official xz test suite (in public domain)

Note that xz is a container format around LZMA2, itself containing raw LZMA streams. My guess is that random bit flips and such mutations applied on xz files would often test the outer layer (container layer), without exercising a lot of the LZMA logic (because most mutations would just produce invalid data at the xz layer).

As a future fuzzing project, it could be interesting to extract an LZMA2 corpus and an LZMA corpus from the xz corpus, to fuzz more specifically the lower layers of the logic.

Add fuzzing target for comparing xz decoding to liblzma (via xz2 wrapper crate)

partly address Add tests against reference libraries #20

Add README with an overview of fuzzing

Testing Strategy

This entire pull request is tests. They seem to detect failures because they've discovered #32.

Good catch!

They haven't found a single bug in decoding, which is completely unexpected. This is the first time I see a codebase that has not been extensively fuzzed previously not fail under fuzzing. But the coverage figures seem to be about right, and failure detection clearly works.

What do you mean by "bug"? I see at least a crash (#32) and a mismatch (#35).

Other than that, my brief explanation would be to run:

git grep unsafe
git grep unwrap

The goal of this project is to prioritize clarity, safety and correctness over performance.

Another possible explanation: a "dumb" fuzzer without domain knowledge doesn't exercise deep interesting code paths. An example that comes to mind (caught during a code review) would be #22 (comment). It would be interesting to know if the fuzzer would catch such a bug if it was introduced in the code base.

I'm wondering if the coverage numbers you mention can pin-point which lines of the source code have been covered by fuzzing - and how often?

Another interesting fuzzing scenario would be to generate "more valid" inputs (e.g. with #24) to test deeper layers of the logic.

There are some mismatches discovered between liblzma and lzma-rs (see #35).

Good catch!

Supporting Documentation and References

https://rust-fuzz.github.io/book/introduction.html

TODO or Help Wanted

For some reason fuzzing lzma1 decoding is 4x slower than lzma2.

My gut feeling would be that random inputs generated by the fuzzer are more likely to be invalid for lzma2 than for lzma1 - and that the invalidity is detected earlier before any allocation happens, etc.

Some possible explanations:

An LZMA stream (what you call "lzma1") has a short header (5 bytes), with the constraint that the first byte must be < 225. After that, it's a range coder (which is rather slow). Once an input satisfies these constraints (at least 5 bytes, first byte < 225), the decoder likely goes quite far in parsing the input (even though the end result is invalid - e.g. an end-of-stream marker is missing).
LZMA2 is a wrapper around LZMA chunks. The first byte of a chunk is more constrained (0, 1, 2 or >= 128). Then 4 bytes of size. Then in 3/4 cases a byte < 255. This makes it more likely that a random input is rejected by lzma2 than by lzma1.
An LZMA2 stream can contain uncompressed streams (which are copied as-is without processing - which is quite fast).

I didn't compare lzma1 decoding to xz2 crate because I'm not sure which variant it xz2 crate uses. It should be trivial to add later if desired.

It would be interesting to add a scenario that generates LZMA streams that are then wrapped into a valid XZ/LZMA2 container, for differential fuzzing of the LZMA layer, without bothering about the API available in xz2.

Continuous fuzzing via e.g. https://fuzzit.dev/ would be very welcome, but can only be set up by maintainers.

It would be nice to advertise in the README that the decoding has been tested to produce identical results to liblzma once the mismatches are fixed (#35).

I'd be cautious not to over-advertise what's been tested until more complex fuzzing scenarios have been tried as well.

gendx · 2020-04-16T20:49:01Z

fuzz/.gitignore

@@ -1,4 +1,3 @@

 target
-corpus


This is intentional:

a pull request with 4574 files is intractable to review,

this adds many unnecessary files to the main git repository.

Although I agree that the corpus can have some value, I would keep that outside of the master branch (for example as a separate open pull request).

I don't believe having corpus in a PR is a good idea. I'd aim for continuous fuzzing via either fuzzit.dev or Google's oss-fuzz, so I'd suggest looking at what those platforms require and working with that. I've seen repos that use fuzzit.dev keep the corpus out of the repo (e.g. https://github.com/image-rs/image-png/) but I don't know how exactly that's accomplished.

Shnatsel · 2020-04-16T23:08:56Z

Can you remove the corpus from this pull request (including the git history), or make a separate pull request without the corpus, to make it easier to review and to keep the git repository clean?

Fair point, will do!

As a future fuzzing project, it could be interesting to extract an LZMA2 corpus and an LZMA corpus from the xz corpus, to fuzz more specifically the lower layers of the logic.

Indeed. I'm not familiar with the XZ format, so any help with that would be appreciated.

What do you mean by "bug"? I see at least a crash (#32) and a mismatch (#35).

The panic was already triggered by XZ test files, so it's not really discovered by the fuzzer. I expected to see more panics, or an OOM, or an infinite loop, or something along those lines. I found nothing that was exploitable in any appreciable way, not even a DoS!

An example that comes to mind (caught during a code review) would be #22 (comment). It would be interesting to know if the fuzzer would catch such a bug if it was introduced in the code base.

If it would result in a panic - then most likely yes! It would also detect returned error or incorrect output if comparison against liblzma is used. Instrumentation-guided fuzzers are pretty sophisticated when it comes to binary formats. Combine that with 100,000,000 executions per day on a $100 desktop CPU and you might even stumble into that edge case by sheer brute force; I've had a fuzzer expose an interesting edge case and stumble upon a valid CRC16 for it overnight.

I'm wondering if the coverage numbers you mention can pin-point which lines of the source code have been covered by fuzzing - and how often?

This is based on LLVM sanitizer-coverage, so you can use any tool that works with it to visualize coverage data. Or just run the entire corpus through a very simple decompression program using the coverage tool of your choice, like tarpaulin.

Shnatsel · 2020-04-16T23:29:21Z

Superseded by #36

Shnatsel added 13 commits April 15, 2020 21:32

Ignore checksum mismatch when decoding XZ files during fuzzing. This …

398c505

…allows random data from fuzzer to reach actually interesting code.

Add fuzzing target for XZ decompression

8fee2ce

Add fuzzing targets for LZMA and LZMA2

6ea27d0

Add initial fuzzing harness for comparing output against xz2 crate (l…

4ff39e2

…iblzma wrapper)

Mark unused variable as unused

b367ffb

Add fuzzing corpus; xz decoding was seeded with official xz test suit…

5593d30

…e (public domain), the rest with files from tests

Register compare_xz fuzzing harness in Cargo.toml

19f203d

Drop huge files from LZMA corpus. The reduced corpus has nearly ident…

394e7cc

…ical coverage

Use decompression corpus also for cross-checking decompression agains…

76a3c10

…t xz2

Expand xz decompression corpus a little bit

0ef7970

Add fuzzing README

2756352

Fix path to output comparison fuzzing target

9557a33

Fix XZ result comparison fuzz target

c22dd7c

Shnatsel force-pushed the decoder-fuzzing branch from 8ebd044 to c22dd7c Compare April 16, 2020 15:40

Address compiler warning

a6b317b

Shnatsel mentioned this pull request Apr 16, 2020

Behavior mismatches between liblzma and lzma-rs #35

Open

gendx requested changes Apr 16, 2020

View reviewed changes

Shnatsel marked this pull request as draft April 16, 2020 21:56

Shnatsel mentioned this pull request Apr 16, 2020

Add decoder fuzzing #36

Merged

Shnatsel closed this Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decoder fuzzing #34

Add decoder fuzzing #34

Shnatsel commented Apr 16, 2020 •

edited

Shnatsel commented Apr 16, 2020

gendx left a comment

gendx Apr 16, 2020

Shnatsel Apr 16, 2020

Shnatsel commented Apr 16, 2020

Shnatsel commented Apr 16, 2020

Add decoder fuzzing #34

Add decoder fuzzing #34

Conversation

Shnatsel commented Apr 16, 2020 • edited

Pull Request Overview

Testing Strategy

Supporting Documentation and References

TODO or Help Wanted

Shnatsel commented Apr 16, 2020

gendx left a comment

Choose a reason for hiding this comment

Pull Request Overview

Testing Strategy

Supporting Documentation and References

TODO or Help Wanted

gendx Apr 16, 2020

Choose a reason for hiding this comment

Shnatsel Apr 16, 2020

Choose a reason for hiding this comment

Shnatsel commented Apr 16, 2020

Shnatsel commented Apr 16, 2020

Shnatsel commented Apr 16, 2020 •

edited