-
Notifications
You must be signed in to change notification settings - Fork 35.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ASM ambiguity #28824
base: master
Are you sure you want to change the base?
Fix ASM ambiguity #28824
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process. |
Closes: bitcoin#27795 Closes: bitcoin#7996 Previously ScriptToAsmStr returned hex-encoded integers, except if data length was <= 4 bytes, in which case it displayed using decimal encoding. This was originally added by Satoshi in 4405b78/script.h#L298-L305 Remove the decimal carve-out for small pushes. Github-Pull: bitcoin#28824 Rebased-From: d6c0c4b
test/functional/rpc_decodescript.py
Outdated
@@ -133,7 +133,7 @@ def decodescript_script_pub_key(self): | |||
cltv_script = '63' + push_public_key + 'ad670320a107b17568' + push_public_key + 'ac' | |||
rpc_result = self.nodes[0].decodescript(cltv_script) | |||
assert_equal('nonstandard', rpc_result['type']) | |||
assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 500000 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm']) | |||
assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 20a107 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems much harder to understand to me -- you need to convert it to 0x07a120
then convert that to decimal to get back to the nice round number block height? In particular mistaking it for 0x20a107
(2,138,375) would be particularly misleading.
test/functional/wallet_send.py
Outdated
@@ -170,7 +170,7 @@ def test_send(self, from_wallet, to_wallet=None, amount=None, data=None, | |||
else: | |||
assert_greater_than(from_balance_before - from_balance, amount) | |||
else: | |||
assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 35"), None) | |||
assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 23"), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems confusing, particularly if it appears in something like DUP SIZE 20 EQUALVERIFY HASH160 <x> EQUAL
-- whoops, that's actually saying the input should be 32 bytes, not 20 bytes.
In the dart library coinlib (https://pub.dev/packages/coinlib) I followed the same approach of encoding everything in HEX with integers encoded as LE. However -1 is encoded as -1. See the code here: https://github.com/peercoin/coinlib/blob/master/coinlib/lib/src/scripts/operations.dart I did this to avoid changing the ASM too much. However, I'm happy to move towards using |
@ajtowns @MatthewLM thank for your comments. I agree that after these changes some ASM fields will be less human-readable-friendly in LE hex. Prefixing with $ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"1112131415 0d1112131415"
$ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
"1234567890 0d500000 OP_CHECKLOCKTIMEVERIFY"
Thinking about this more, it feels like the question to be answered is perhaps "is the $ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
"0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY" This would create a moderately inconvenient interface in a few of the tests (and possibly some downstream users relying on this output) where hex prefixes, if present, have to be stripped from rpc output, but perhaps that's not too terrible. If we are designing the asm output to be a canonical, machine-readable format then my opinion remains that using hex-only makes the most sense. FWIW btcdeb interprets all values as decimal but prints a warning that they are ambiguous: $ btcc 1234567890 500000 OP_CHECKLOCKTIMEVERIFY
warning: ambiguous input 1234567890 is interpreted as a numeric value; use 0x1234567890 to force into hexadecimal interpretation
warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
04d20296490320a107b1
$ btcc 0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY
warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
0512345678900320a107b1
And consequently generates a different script without manually specifying hex. I don't know what other tools do by default. |
I agree that |
I think it's human readable? For machines we could just give the raw hex script and not bother with an asm format. Anyway, how about going for something more left field and having
Could perhaps also represent non-minimal pushes that way, eg the different encodings of -1 or
|
This has always been my interpretation too.
Square brackets is something I could get behind; to me it is clear from your example that the brackets are indicating something, and just by looking at them I can see in most cases that they're hex values. To use the previous (less common?) example formatted with the brackets: $ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"[1112131415] 1112131415" IMO it's again clear to me now that there's something different about these two values, and if I didn't know what it was then I should find out. One question I have is whether this will interfere with the scripthash types too badly? e.g. Is this nested bracket undesirable? $ src/bitcoin-cli -regtest decoderawtransaction 0100000001696a20784a2c70143f634e95227dbdfdf0ecd51647052e70854512235f5986ca010000008a47304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb014104d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536ffffffff0100e1f505000000001976a914eb6c6e0cdb2d256a32d97b8df1fc75d1920d9bca88ac00000000 | jq .vin[].scriptSig.asm
"[304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb[ALL]] [04d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536]"
specifically: [... 2dfb[ALL]] It might be preferable to have this format instead? [... 2dfb][ALL] |
|
It's pretty misleading though, since Note this applies to block hashes, as well, so if you provide block 816651's header and execute HASH256, you end up with |
I guess it being inside the brackets seems slightly clearer to me? Could have different brackets for the two things, eg |
Those smaller numbers will be shown in decimal still. For larger pushdata, hex is already shown in the little-endian encoded byte order. If |
Do we actually want to resolve all forms of ambiguity? E.g. what about the distinction between an OP_1 vs. a direct push of 0x01, vs. OP_PUSHDATA1 of 0x01? I think it'd be nice if there was a one-to-one correspondence between the disassembly and the actual script bytes. My contribution to this bikeshedding would be:
|
👍
Also
Provided you've already dealt with non-minimally encoded -1 and 1..16, using decimals seems nicer here? You can cope with negatives cleanly (
That would be nice. Probably requires remembering info from the scriptPubKey and using it when decoding the scriptSig or witness. |
I was thinking it would be nice to retain decimal but, due to ambiguity that could arise with CLTV and CSV, I'm not sure. Having embedded ASM for redeem scripts and Tapscripts would be very useful. |
What ambiguity? |
For 32-bit signed integers, the 32nd leftmost bit is negative, but that's not true for CLTV and CSV which uses 5 bytes. So a push data could represent a negative 32-bit signed integer, or a positive 40-bit signed integer if I understand the |
@MatthewLM I don't think that's correct. What's special in CSV and CLTV is that they permit numbers whose encoding is up to 5 bytes (rather than the math opcodes which only support encodings up to 4 bytes on input). But the encoding algorithm is the same: the top bit is the sign. |
@sipa I'll have to look at the code. I imagine it is two's complement? Then the negative bit will be different for each type of integer meaning that a 4 byte pushdata that would encode a negative 32-bit signed integer could also be interpreted as a positive 40-bit signed integer. |
@ajtowns Ok, iterating on that, new proposal:
There is no need for @MatthewLM I think you're missing something, but I don't see what. There is no ambiguity. Every byte encoding represents exactly one number, with a well-defined sign. |
@sipa Sorry, you are right. I looked at the code again. Bitcoin doesn't use two's complement for this (I have to keep in mind that Bitcoin has a lot of quirks). It uses a sign bit that negates the other bits, and this sign bit depends on the size of the push data (whatever the last byte is), so there is no ambiguity. Therefore I think minimally pushed numbers ought to be displayed as decimal for readability reasons as you now suggest.
Or any other decimal that has an even number of digits. Maybe use something like |
@MatthewLM Right, any single numeric push of even number of digits inside |
One more thought: if the format we end up with is unambiguous, there could also be a function/RPC/tool to convert asm back to script (and we can have fuzz tests that the roundtripping encoding+decoding is exact!). If so, we can probably permit a few things in the decoding that the encoder might not (yet) support:
|
I like where this is going now. Thanks for the conversation. I've written a branch implementing this previous version as I wanted to see what the changes would look like (although I'm not convinced I'm handling
I think this will make the representation even nicer to interpret and will gladly make this change.
I'll take a stab at implementing these nested scripts as well, as I think they'd be a nice improvement too... |
Per
Any push of a single positive number between 10-99; 1,000-9,999; 100,000-999,999; 10,000,000-99,999,999; 1,000,000,000-9,999,999,999; and 100,000,000,000-549,755,813,887 would be ambiguous, no? |
@ajtowns Right on all 3 counts. Instead of limiting the integer range, maybe just "all minimal pushes of up to 5 bytes, which push a minimally-encoded integer". |
One more question: what to do with unknown opcodes, and undecodable bytes? The current Suggestion: use |
Currently As for undecodable bytes, if we did want a keyword, perhaps something more explicit like
|
That's trivially ambiguous, as all unknown opcodes are mapped to the same thing.
Maybe, but I wouldn't use
No, any pushes become either decimal numbers, or EDIT: actually, I think all undecodable things (ignoring unknown opcodes) are pushes (but ones which straddle the end of the script boundary). |
016803c
to
89547bb
Compare
Marked as draft for now and pushed what I have worked on so far. I haven't looked at implementing undecodable bytes or nested scripts yet. |
src/core_write.cpp
Outdated
if (CheckMinimalPush(vch, opcode)) { | ||
if (vch.size() <= 5 ) { | ||
// Return decimal for minimially-encoded values <= 5 bytes (OP_CLTV / OP_CSV accept 5-byte numbers) | ||
CScriptNum n(vch, false, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want fRequireMinimal=true
true here, otherwise it'll turn non-minimally-encoded values into decimal too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it should be true
as a belt-and-suspenders, but we are inside if (CheckMinimalPush(vch, opcode))
already here, so shouldn't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Byte code 024000
is a minimal push of the two-byte vector 4000
(so passes CheckMinimalPush
) but it's not a minimal CScriptNum
since it just evaluates to 64, and should have been 0140
(so fails fRequireMinimal
due to the trailing 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are two distinct concepts of minimality:
- Minimal pushes apply to the interpretation of stack elements as byte vectors, and are about whether you used the right opcode (OP_n, direct push, OP_PUSHDATA1, OP_PUSHDATA2, OP_PUSHDATA4): you have to use the first applicable one from that list for a push to be a minimal push. It's relevant even for non-numeric data in script (e.g. you shouldn't use OP_PUSHDATA1 to push a public key, as a direct push suffices).
- Minimally-encoded integers apply to the interpretation of stack elements as numbers, and are about whether no surplus bytes were introduced when encoding the number as a byte vector but not about how that byte vector gets pushed. The byte vectors
42
,4200
,420000
all encode the number 66, and83
,0380
,030080
, ... all encode the number -3, but you have to use the first one from each list for it to be minimally-encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations.
I've refactored out handling pushed data, as I wasn't enjoying all the nested if statements, and notice that we are currently handling "non-minimal pushes" the same as minimal pushes with non-minimally-encoded values, which is to use OPCODE<hex>
. This seems ok, and it will not be a lossy decode, but do we want to differentiate between the two pathways to this decode format?
See first and last case in snippet to illustrate what I mean above:
/*
* Format pushed bytes into appropriate ASM format
*/
std::string FormatPushDataAsm(const std::vector<unsigned char>& vch, opcodetype opcode)
{
// Use OPCODE<hex> for non-minimal pushes
// TODO: There are no tests for this currently
if (!CheckMinimalPush(vch, opcode)) {
return GetOpNameAsm(opcode) + BracketStr(HexStr(vch));
}
// Use <hex> for minimal pushes > 5 bytes
if (!(vch.size() <= 5) ) {
return BracketStr(HexStr(vch));
}
// Use decimal for minimally-encoded, minimal pushes <= 5 bytes
// Note: OP_CLTV / OP_CSV accept 5-byte numbers
try {
CScriptNum n{vch, true, 5};
return strprintf("%lld", n.GetInt64());
}
// Use OPCODE<hex> for non-minimally-encoded minimal pushes
// TODO: There are no tests for this currently
catch (scriptnum_error& e) {
return GetOpNameAsm(opcode) + BracketStr(HexStr(vch));
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but what I mean is that in my current impl there are two ways we can end up with OPCODE<...>
:
- Wasn't a minimal push
- Was a minimal push <5 bytes, but wasn't a minimally-encoded decimal value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, right. I think that's redundant.
I'd go for something like:
- If minimal push, not over 5 bytes, and minimally encoded, use decimal.
- Otherwise, if minimal push, or direct push, use
<...>
. - Otherwise, use
OPCODE<...>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unambiguous, with the following decoding rules:
- Decimal numbers are turned into
OP_n
if applicable, otherwise into a direct push of the minimally-encoded form of that number. <...>
is turned into a direct push if up to 75 bytes, intoOP_PUSHDATA1
if below 256 bytes, intoOP_PUSHDATA2
if below 65536 bytes, and intoOP_PUSHDATA4
otherwise.OPCODE<...>
is turned into a push using the relevant opcode.
In particular (I think @ajtowns pointed this out before as well), using <...>
for non-minimal but direct pushes is fine, as it cannot represent anything else (not an OP_n
, because those use decimal, and not a PUSHDATA
opcode because if those were used for sizes <= 75 bytes, they'd have used OPCODE<...>
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This just means changing the final catch
to return BracketStr(HexStr(vch));
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also,
if (!(vch.size() <= 5) ) {
Can I introduce you to >
? 😄
src/core_write.cpp
Outdated
} else { | ||
str += HexStr(vch); | ||
// Otherwise display the push as LE hex enclosed in angle brackets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this to be a controversial opinion, but I disagree with calling this "little endian". That's a term that applies to encoding numbers to bytes/bits. Since what's being encoded isn't a number, or to be interpreted as a number, endianness is inapplicable. It's just encoding bytes in hexadecimal format, in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not little endian, I agree.
Brainstormy idea for undecodable scripts. These are necessarily pushes (direct or OP_PUSHDATA[124]) whose prescribed length exceeds the end of the script. What about using EDIT: for |
Currently failure to fully parse in
I don't think it makes much difference which flavour is chosen here; both would require the |
In case it's a direct push there is no opcode at all. I didn't mean this as a flavor to choose from; we need both. |
If you're adding extra encoding that need to be parsed ( |
Yeah, my only (rather minor) reservation about That said, I agree that the +N thing isn't super readabable, and it's incomplete anyway (can't deal with OP_PUSHDATA[124] whose length descriptor itself is missing). So maybe just |
Yeah, I guess being able to scan and say
Could do (Arguably, everything after the first OP_SUCCESS in a tapscript could also be considered trailing hex that shouldn't be decoded. Probably more useful to decode it anyway, though) |
Are you still working on this? |
We don't decode these on taproot signatures already, modify non-taproot signatures to match.
89547bb
to
0a7c9d9
Compare
Strip unneccesary leading "OP_" prefix from opcodes when they are rendered as ASM.
0a7c9d9
to
71e6cf5
Compare
A simple helper function to angle bracket a hex string.
* Return minimal pushes < 5 bytes as decimal * Return non-minimal pushes < 5 bytes wrapped as OP_CODE<...> * Return pushes > 5 bytes as hex wrapped as <...>
71e6cf5
to
4890626
Compare
Attempt to revive the discussion: #27795 (comment) |
Closes: #27795
Closes: #7996
Previously
ScriptToAsmStr
returned hex-encoded values, except if data length was <= 4 bytes, in which case it displayed using decimal encoding.Fix the ASM representation by specifying an unambiguous encoding:
OP_
prefix from all opcodesPUSHDATA1<0100>
<...>
42
<4200>
UNDECODABLE(...)
This should permit unambiguous decoding in the future.