New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzz: a more efficient descriptor parsing target #27888
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
9cdb8c7
to
001b169
Compare
Looks like this touches non-fuzz code? No opinion if people want this, but a much simpler implementation would be to just create a copy of the buffer and inject the pre-generated static string into it. (This is basically what a fuzz engine does automatically, with the difference that you can now provide the string dictionary yourself, directly in the fuzz target). the fuzz target would look like: auto str{fdp.ConsumeString()};
if (fdp.ConsumeBool()) {
str = MockDescriptor(str);
}
const auto desc = Parse(str); With This has the benefits of not mocking out the parsing logic, which for |
In what you suggest To be clear this is not only about mocking |
Ok, I see. I guess another alternative would be to randomly inject random pre-generated keys at random positions in the string, absent of any logic. Then let the fuzz engine figure out the right positions via coverage feedback. No strong opinion, just leaving random ideas that can be implemented with less code. |
Concept ACK Mocking seems fine to me, but I wonder if we could achieve something similar by placing (in)valid encoded keys in a fuzz dictionary (e.g. bitcoin-core/qa-assets#122), which is also almost the same as Marco's suggestion. This would be less efficient compared to what is in this PR but the inputs would still be available for debugging over RPC. We can of course also just do both since adding to the dictionaries is very easy. |
Thanks both for throwing in ideas. Note i don't have strong opinions here either, it's just some review code that i figured could be helpful having in too. However i still don't think the approaches suggested here would be better:
|
Another alternative may be to just perform search-replace in the fuzz-read string before handing it to the parser? Eg anything of the form "%XX" where XX is two hex characters, is replaced by a lookup in a table. |
Hmm good point. Using a distinguishable character fixes the issue of having to "basically reimplement descriptor parsing logic".
------- Original Message -------
…On Wednesday, June 14th, 2023 at 5:19 PM, Pieter Wuille ***@***.***> wrote:
Another alternative may be to just perform search-replace in the fuzz-read string before handing it to the parser? Eg anything of the form "%XX" where XX is two hex characters, is replaced by a lookup in a table.
—
Reply to this email directly, [view it on GitHub](#27888 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AFLK3F3PYOENOE2ZPE75HITXLHI7ZANCNFSM6AAAAAAZGLFYQU).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I am happy to run a bench, if you happen to have a bug laying around :) |
I don't, but i think Pieter settled the debate anyways. :)
------- Original Message -------
…On Wednesday, June 14th, 2023 at 5:31 PM, MacrabFalke ***@***.***> wrote:
> I may be underestimating the fuzzer's capabilities but it sounds much less efficient.
I am happy to run a bench, if you happen to have a crash+bug laying around :)
—
Reply to this email directly, [view it on GitHub](#27888 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AFLK3F75QKRNVW44OITSYKLXLHKO5ANCNFSM6AAAAAAZGLFYQU).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
7f384ae
to
3def962
Compare
3def962
to
8edac3d
Compare
Updated following @sipa's suggestion of using a marker to be able to search and replace in the descriptor string directly. The diff is now twice smaller, and this does not touch the descriptor parsing logic anymore. After running the new
|
8edac3d
to
fad60fc
Compare
fad60fc
to
b23c2e2
Compare
Suggest putting "Fuzz" in title, and labelling. |
We'll be reusing it in the new target.
b23c2e2
to
7349861
Compare
For testing I've injected a bug in currently uncovered code: diff --git a/src/script/descriptor.cpp b/src/script/descriptor.cpp
index 09ded5fc61..b69db182ab 100644
--- a/src/script/descriptor.cpp
+++ b/src/script/descriptor.cpp
@@ -509,7 +509,7 @@ public:
out = "[" + origin_str + "]" + EncodeExtPubKey(xpub) + FormatHDKeypath(end_path);
if (IsRange()) {
out += "/*";
- assert(m_derive == DeriveType::UNHARDENED);
+ assert(m_derive != DeriveType::UNHARDENED); // Injected BUG!! (bad)
}
return true;
} ... But it looks like the fuzz target just crashed immediately anyway (see CI) |
7349861
to
84dee4f
Compare
Thanks for testing. Fixed the issue. I also introduced the bug you shared and it does make the target crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, lgtm ACK 84dee4f 🦄
Show signature
Signature:
untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: nice, lgtm ACK 84dee4fe690e08a5adaad1c78530666da07075d8 🦄
ag2blJR9VoWl2NoAXwUZSDU54s5hPaFz7IJt+iLlaSMLYtYwbr4/XMw4CW9xm1A+BXAn9ZXtWQxrwFSRr/COBg==
cfc6a6e
to
f00c62e
Compare
This new target focuses on fuzzing the actual descriptor parsing logic by not requiring the fuzzer to produce valid keys (nor a valid checksum for that matter). This should make it much more efficient to find bugs we could introduce moving forward. Using a character as a marker (here '%') to be able to search and replace in the string without having to mock the actual descriptor parsing logic was an insight from Pieter Wuille.
Once a descriptor is successfully parsed, execute more of its methods. There is probably still room for improvements by checking for some invariants, but this is a low hanging fruit that significantly increases the code coverage of these targets.
f00c62e
to
131314b
Compare
re-ACK 131314b 🐓 Show signatureSignature:
|
ACK 131314b |
std::optional<uint8_t> IdxFromHex(std::string_view hex_characters) const { | ||
if (hex_characters.size() != 2) return {}; | ||
auto idx = ParseHex(hex_characters); | ||
if (idx.size() != 1) return {}; | ||
return idx[0]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just use the raw (single) byte here, but that would interfere with libFuzzer -only_ascii=1
, which makes me wonder if this is the first ascii fuzz target and whether we should set the option somewhere somehow during input generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny how -only_ascii=1
performs worse than -only_ascii=0
(#27888 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or at least, not significantly better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the target returning early on non-ASCII has basically the same effect?
131314b fuzz: increase coverage of the descriptor targets (Antoine Poinsot) 90a2474 fuzz: add a new, more efficient, descriptor parsing target (Antoine Poinsot) d60229e fuzz: make the parsed descriptor testing into a function (Antoine Poinsot) Pull request description: The current descriptor parsing fuzz target requires valid public or private keys to be provided. This is unnecessary as we are only interested in fuzzing the descriptor parsing logic here (other targets are focused on fuzzing keys serializations). And it's pretty inefficient, especially for formats that need a checksum (`xpub`, `xprv`, WIF). This introduces a new target that mocks the keys as an index in a list of precomputed keys. Keys are represented as 2 hex characters in the descriptor. The key type (private, public, extended, ..) is deterministically based on this one-byte value. Keys are deterministically generated at target initialization. This is much more efficient and also largely reduces the size of the seeds. TL;DR: for instance instead of requiring the fuzzer to generate a `pk(xpub6DdBu7pBoyf7RjnUVhg8y6LFCfca2QAGJ39FcsgXM52Pg7eejUHLBJn4gNMey5dacyt4AjvKzdTQiuLfRdK8rSzyqZPJmNAcYZ9kVVEz4kj)` to parse a valid descriptor, it just needs to generate a `pk(03)`. Note we only mock the keys themselves, not the entire descriptor key expression. As we want to fuzz the real code that parses the rest of the key expression (origin, derivation paths, ..). This is a target i used for reviewing bitcoin#17190 and bitcoin#27255, and figured it was worth PR'ing on its own since the added complexity for mocking the keys is minimal and it could help prevent introducing bugs to the descriptor parsing logic much more efficiently. ACKs for top commit: MarcoFalke: re-ACK 131314b 🐓 achow101: ACK 131314b Tree-SHA512: 485a8d6a0f31a3a132df94dc57f97bdd81583d63507510debaac6a41dbbb42fa83c704ff3f2bd0b78c8673c583157c9a3efd79410e5e79511859e1470e629118
fa3a410 fuzz: Set -rss_limit_mb=8000 for generate as well (MarcoFalke) fa4e396 fuzz: Generate with random libFuzzer settings (MarcoFalke) Pull request description: Sometimes a libFuzzer setting like `-use_value_profile=1` helps [0], sometimes it hurts [1]. [0] #20789 (comment) [1] #27888 (comment) By picking a random value, it is ensured that at least some of the runs will have the beneficial configuration set. Also, set `-max_total_time` to prevent slow fuzz targets from getting a larger time share, or possibly peg to a single core for a long time and block the python script from exiting for a long time. This can be improved in the future. For example, the python script can exit after some time (#20752 (comment)). Alternatively, it can measure if coverage progress was made and run for less time if no progress has been made recently anyway, so that more time can be spent on targets that are new or still make progress. ACKs for top commit: murchandamus: utACK fa3a410 dergoegge: utACK fa3a410 brunoerg: light ACK fa3a410 Tree-SHA512: bfd04a76ca09aec612397bae5f3f263a608faa7087697169bd4c506c8195c4d2dd84ddc7fcd3ebbc75771eab618fad840af819114968ca3668fc730092376768
The current descriptor parsing fuzz target requires valid public or private keys to be provided. This is unnecessary as we are only interested in fuzzing the descriptor parsing logic here (other targets are focused on fuzzing keys serializations). And it's pretty inefficient, especially for formats that need a checksum (
xpub
,xprv
, WIF).This introduces a new target that mocks the keys as an index in a list of precomputed keys. Keys are represented as 2 hex characters in the descriptor. The key type (private, public, extended, ..) is deterministically based on this one-byte value. Keys are deterministically generated at target initialization. This is much more efficient and also largely reduces the size of the seeds.
TL;DR: for instance instead of requiring the fuzzer to generate a
pk(xpub6DdBu7pBoyf7RjnUVhg8y6LFCfca2QAGJ39FcsgXM52Pg7eejUHLBJn4gNMey5dacyt4AjvKzdTQiuLfRdK8rSzyqZPJmNAcYZ9kVVEz4kj)
to parse a valid descriptor, it just needs to generate apk(03)
.Note we only mock the keys themselves, not the entire descriptor key expression. As we want to fuzz the real code that parses the rest of the key expression (origin, derivation paths, ..).
This is a target i used for reviewing #17190 and #27255, and figured it was worth PR'ing on its own since the added complexity for mocking the keys is minimal and it could help prevent introducing bugs to the descriptor parsing logic much more efficiently.