Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ASM ambiguity #28824

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Fix ASM ambiguity #28824

wants to merge 7 commits into from

Conversation

willcl-ark
Copy link
Member

@willcl-ark willcl-ark commented Nov 8, 2023

Closes: #27795
Closes: #7996

Previously ScriptToAsmStr returned hex-encoded values, except if data length was <= 4 bytes, in which case it displayed using decimal encoding.

Fix the ASM representation by specifying an unambiguous encoding:

  • Drop OP_ prefix from all opcodes
  • For non-minimal pushes prefix with the opcode and enclose pushed hex value in angle brackets PUSHDATA1<0100>
  • For minimal pushes:
    • If > 5 bytes display pushed hex value enclosed in angle brackets <...>
    • If <= 5 bytes:
      • If minimally-encoded display pushed value in decimal without angle brackets 42
      • If not minimally-encoded display pushed hex value enclosed in angle brackets <4200>
    • Display undecodable bytes using UNDECODABLE(...)

This should permit unambiguous decoding in the future.

@DrahtBot
Copy link
Contributor

DrahtBot commented Nov 8, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Nov 9, 2023
Closes: bitcoin#27795
Closes: bitcoin#7996

Previously ScriptToAsmStr returned hex-encoded integers, except if data
length was <= 4 bytes, in which case it displayed using decimal
encoding.

This was originally added by Satoshi in
4405b78/script.h#L298-L305

Remove the decimal carve-out for small pushes.

Github-Pull: bitcoin#28824
Rebased-From: d6c0c4b
@@ -133,7 +133,7 @@ def decodescript_script_pub_key(self):
cltv_script = '63' + push_public_key + 'ad670320a107b17568' + push_public_key + 'ac'
rpc_result = self.nodes[0].decodescript(cltv_script)
assert_equal('nonstandard', rpc_result['type'])
assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 500000 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm'])
assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 20a107 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems much harder to understand to me -- you need to convert it to 0x07a120 then convert that to decimal to get back to the nice round number block height? In particular mistaking it for 0x20a107 (2,138,375) would be particularly misleading.

@@ -170,7 +170,7 @@ def test_send(self, from_wallet, to_wallet=None, amount=None, data=None,
else:
assert_greater_than(from_balance_before - from_balance, amount)
else:
assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 35"), None)
assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 23"), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems confusing, particularly if it appears in something like DUP SIZE 20 EQUALVERIFY HASH160 <x> EQUAL -- whoops, that's actually saying the input should be 32 bytes, not 20 bytes.

@MatthewLM
Copy link

MatthewLM commented Nov 9, 2023

In the dart library coinlib (https://pub.dev/packages/coinlib) I followed the same approach of encoding everything in HEX with integers encoded as LE. However -1 is encoded as -1. See the code here: https://github.com/peercoin/coinlib/blob/master/coinlib/lib/src/scripts/operations.dart

I did this to avoid changing the ASM too much. However, I'm happy to move towards using 0x for hex and presenting decimal for smaller data of up-to 32bits if this is introduced in Bitcoin. This would make the ASM easy to understand and read.

@willcl-ark
Copy link
Member Author

@ajtowns @MatthewLM thank for your comments.

I agree that after these changes some ASM fields will be less human-readable-friendly in LE hex.

Prefixing with 0d was another possibility discussed previously, but there still exists a chance that someone will interpret these as hex-encoded values, as d is a valid hex char (whereas x is clearly not decimal or hex). e.g this is not necessarily that much clearer, but does offer some improvement:

$ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"1112131415 0d1112131415"

$ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
"1234567890 0d500000 OP_CHECKLOCKTIMEVERIFY"

Thinking about this more, it feels like the question to be answered is perhaps "is the asm field supposed to be human or machine readable?" If the former, then maintaining decimal for small values, and adding a hex-prefix for > 4 byte values might be most desirable, even it it leaves some ambiguity in some circumstances?

$ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
"0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY"

This would create a moderately inconvenient interface in a few of the tests (and possibly some downstream users relying on this output) where hex prefixes, if present, have to be stripped from rpc output, but perhaps that's not too terrible.

If we are designing the asm output to be a canonical, machine-readable format then my opinion remains that using hex-only makes the most sense.

FWIW btcdeb interprets all values as decimal but prints a warning that they are ambiguous:

$ btcc 1234567890 500000 OP_CHECKLOCKTIMEVERIFY
warning: ambiguous input 1234567890 is interpreted as a numeric value; use 0x1234567890 to force into hexadecimal interpretation
warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
04d20296490320a107b1

$ btcc 0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY
warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
0512345678900320a107b1

And consequently generates a different script without manually specifying hex. I don't know what other tools do by default.

@MatthewLM
Copy link

I agree that 0d is not a good idea, much better to go with 0x for hex.

@ajtowns
Copy link
Contributor

ajtowns commented Nov 9, 2023

Thinking about this more, it feels like the question to be answered is perhaps "is the asm field supposed to be human or machine readable?"

I think it's human readable? For machines we could just give the raw hex script and not bother with an asm format.

Anyway, how about going for something more left field and having decodescript 0511121314150457c74942 output [1112131415] 1112131415 where the square brackets indicated "quoted hexidecimal", and values without square brackets are always decimal numbers within CScriptNum range? That would still be pretty clear for pubkeys:

2 [027d28a3580d0823dbee34f1df927ee58664eb8349e79b94017ddaf3f7e159153f] [03a3af0f49a21d29106ebeef9b3c3d69fc375c856a7153d97156a5b7a161ca6c25] 2 OP_CHECKMULTISIG

Could perhaps also represent non-minimal pushes that way, eg the different encodings of -1 or [ff], something like:

 decodescript 4f -> -1
 decodescript 0181 -> [0:81]
 decodescript 4c0181 -> [1:81]
 decodescript 4d010081 -> [2:81]
 decodescript 4e0100000081 -> [4:81]

@willcl-ark
Copy link
Member Author

I think it's human readable? For machines we could just give the raw hex script and not bother with an asm format.

This has always been my interpretation too.

Anyway, how about going for something more left field...

Square brackets is something I could get behind; to me it is clear from your example that the brackets are indicating something, and just by looking at them I can see in most cases that they're hex values. To use the previous (less common?) example formatted with the brackets:

$ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"[1112131415] 1112131415"

IMO it's again clear to me now that there's something different about these two values, and if I didn't know what it was then I should find out.

One question I have is whether this will interfere with the scripthash types too badly? e.g. Is this nested bracket undesirable?

$ src/bitcoin-cli -regtest decoderawtransaction 0100000001696a20784a2c70143f634e95227dbdfdf0ecd51647052e70854512235f5986ca010000008a47304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb014104d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536ffffffff0100e1f505000000001976a914eb6c6e0cdb2d256a32d97b8df1fc75d1920d9bca88ac00000000 | jq .vin[].scriptSig.asm
"[304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb[ALL]] [04d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536]"

specifically:

[... 2dfb[ALL]]

It might be preferable to have this format instead?

[... 2dfb][ALL]

@MatthewLM
Copy link

0x is much more standard and recognisable than [].

@ajtowns
Copy link
Contributor

ajtowns commented Nov 13, 2023

0x is much more standard and recognisable than [].

It's pretty misleading though, since CScriptNum is little-endian: if you see 0x0100 you would expect that to be 256, but for us it would actually be non-minimally-encoded 1; For us, 0x0201 would actually be less than 0x0102, etc.

Note this applies to block hashes, as well, so if you provide block 816651's header and execute HASH256, you end up with [7d6368795d9675a198bd1dbe3d87dbdbbc3e0f3560a202000000000000000000]; if you were to write 0x00000000000000000002a260350f3ebcdbdb873dbe1dbd98a175965d7968637d to refer to the block hash, the same way we do in kernel/chainparams.cpp, you'd end up with the wrong thing. But you can't have "0x" just reverse things either, as that would mess up 33-byte pubkeys (they'd now end with 02 or 03 instead of beginning with it) and signatures.

@ajtowns
Copy link
Contributor

ajtowns commented Nov 13, 2023

One question I have is whether this will interfere with the scripthash types too badly? e.g. Is this nested bracket undesirable?

specifically:

[... 2dfb[ALL]]

It might be preferable to have this format instead?

[... 2dfb][ALL]

I guess it being inside the brackets seems slightly clearer to me? Could have different brackets for the two things, eg [... 2dfb<ALL>]. Could perhaps also drop the feature entirely (reverting #5264); note that we don't display sighash flags for taproot signatures.

@MatthewLM
Copy link

It's pretty misleading though, since CScriptNum is little-endian: if you see 0x0100 you would expect that to be 256, but for us it would actually be non-minimally-encoded 1; For us, 0x0201 would actually be less than 0x0102, etc.

Those smaller numbers will be shown in decimal still. For larger pushdata, hex is already shown in the little-endian encoded byte order. If 0x is too suggestive of a big-endian hex number (a fair point), then maybe bytes(123456789abcdef) is clearer and less ambiguous (maybe use push instead of bytes). If following that scheme, signatures could be something like sig(...2dfb, ALL). Possibly public keys could be pk(...).

@sipa
Copy link
Member

sipa commented Nov 14, 2023

Do we actually want to resolve all forms of ambiguity? E.g. what about the distinction between an OP_1 vs. a direct push of 0x01, vs. OP_PUSHDATA1 of 0x01? I think it'd be nice if there was a one-to-one correspondence between the disassembly and the actual script bytes.

My contribution to this bikeshedding would be:

  • OP_n -> n (so -1, 0, 1, ..., 10, ..., 16).
  • Drop "OP_" prefix for other opcodes too; they're redundant.
  • Other minimal pushes (so direct pushes, as well as OP_PUSHDATA1 over 75 bytes, and OP_PUSHDATA2 over 255 bytes):
    • If the data pushed is a minimally-encoded positive number between 0 and 2^39-1 (OP_CLTV / OP_CSV accept 5-byte numbers) -> <0x12345678> (BE hex, as that's what a human expects for a numeric value with 0x).
    • Otherwise, <1234567890ABCDEF> (in wire byte order, no 0x because not attempting to convey a number)
  • For non-minimal pushes, use PUSHDATA1<...> or PUSHDATA2<...> (with the contents of ... using the same rules as for minimal pushes).
  • Drop the sighash byte type decoding. I don't feel it matches what asm is trying to do.
  • Optionally, for P2SH(-like) scriptSigs, the final redeemScript push could be recursively decoded (so inside <...>, put the asm decoding of the script being pushed), while remaining fully unambiguous.

@ajtowns
Copy link
Contributor

ajtowns commented Nov 15, 2023

My contribution to this bikeshedding would be:
* OP_n -> n (so -1, 0, 1, ..., 10, ..., 16).
* Drop "OP_" prefix for other opcodes too; they're redundant.
* Drop the sighash byte type decoding. I don't feel it matches what asm is trying to do.

👍

  • For non-minimal pushes, use PUSHDATA1<...> or PUSHDATA2<...> (with the contents of ... using the same rules as for minimal pushes).

Also PUSHDATA4<...> for completeness, presumably. A direct push of values -1 and 1..16 is also non-minimal (ie, 0101 instead of 51), could perhaps use PUSHDATA<..> (without 1, 2 or 4) for that case.

  • Other minimal pushes (so direct pushes, as well as OP_PUSHDATA1 over 75 bytes, and OP_PUSHDATA2 over 255 bytes):
  • If the data pushed is a minimally-encoded positive number between 0 and 2^39-1 (OP_CLTV / OP_CSV accept 5-byte numbers) -> <0x12345678> (BE hex, as that's what a human expects for a numeric value with 0x).

Provided you've already dealt with non-minimally encoded -1 and 1..16, using decimals seems nicer here? You can cope with negatives cleanly (-2 vs <82>), and it's one less encoding variation?

  • Optionally, for P2SH(-like) scriptSigs, the final redeemScript push could be recursively decoded (so inside <...>, put the asm decoding of the script being pushed), while remaining fully unambiguous.

That would be nice. Probably requires remembering info from the scriptPubKey and using it when decoding the scriptSig or witness.

@MatthewLM
Copy link

Provided you've already dealt with non-minimally encoded -1 and 1..16, using decimals seems nicer here? You can cope with negatives cleanly (-2 vs <82>), and it's one less encoding variation?

I was thinking it would be nice to retain decimal but, due to ambiguity that could arise with CLTV and CSV, I'm not sure.

Having embedded ASM for redeem scripts and Tapscripts would be very useful.

@ajtowns
Copy link
Contributor

ajtowns commented Nov 15, 2023

I was thinking it would be nice to retain decimal but, due to ambiguity that could arise with CLTV and CSV, I'm not sure.

What ambiguity?

@MatthewLM
Copy link

MatthewLM commented Nov 15, 2023

For 32-bit signed integers, the 32nd leftmost bit is negative, but that's not true for CLTV and CSV which uses 5 bytes. So a push data could represent a negative 32-bit signed integer, or a positive 40-bit signed integer if I understand the CScriptNum without looking more closely at it.

@sipa
Copy link
Member

sipa commented Nov 15, 2023

@MatthewLM I don't think that's correct. What's special in CSV and CLTV is that they permit numbers whose encoding is up to 5 bytes (rather than the math opcodes which only support encodings up to 4 bytes on input). But the encoding algorithm is the same: the top bit is the sign.

@MatthewLM
Copy link

MatthewLM commented Nov 15, 2023

@sipa I'll have to look at the code. I imagine it is two's complement? Then the negative bit will be different for each type of integer meaning that a 4 byte pushdata that would encode a negative 32-bit signed integer could also be interpreted as a positive 40-bit signed integer.

@sipa
Copy link
Member

sipa commented Nov 15, 2023

@ajtowns Ok, iterating on that, new proposal:

  • Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).
  • OP_ prefix dropped in general
  • Drop sighash type.
  • Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with ... the hex-encoding (in wire byte order) of the push.
  • Non-minimal pushes become PUSHDATA1<...>, PUSHDATA2<...>, or PUSHDATA4<...>.
  • Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

There is no need for PUSHDATA<...> here, because a direct push of 3 for example (0103 in hex) would become <03>, while OP_3 would become 3.

@MatthewLM I think you're missing something, but I don't see what. There is no ambiguity. Every byte encoding represents exactly one number, with a well-defined sign.

@MatthewLM
Copy link

@sipa Sorry, you are right. I looked at the code again. Bitcoin doesn't use two's complement for this (I have to keep in mind that Bitcoin has a lot of quirks). It uses a sign bit that negates the other bits, and this sign bit depends on the size of the push data (whatever the last byte is), so there is no ambiguity.

Therefore I think minimally pushed numbers ought to be displayed as decimal for readability reasons as you now suggest.

  • Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

Or any other decimal that has an even number of digits. Maybe use something like SCRIPT<...>?

@sipa
Copy link
Member

sipa commented Nov 15, 2023

@MatthewLM Right, any single numeric push of even number of digits inside <> would be ambiguous. But also, scripts consisting of a single push are generally useless as a redeemscript (they'd be anyone-can-spend), so I'm not sure it's an issue. If you actually encounter one of those, just don't do recursive decoding.

@sipa
Copy link
Member

sipa commented Nov 15, 2023

One more thought: if the format we end up with is unambiguous, there could also be a function/RPC/tool to convert asm back to script (and we can have fuzz tests that the roundtripping encoding+decoding is exact!). If so, we can probably permit a few things in the decoding that the encoder might not (yet) support:

  • Optionally OP_ prefixing opcodes (and using aliases like TRUE or OP_TRUE for OP_1).
  • Using 0x... for BE hex encoded numbers in all places where a decimal number is supported (so outside <>).
  • PUSHDATA1<...> or PUSHDATA2<...> notation even when the push is minimal.
  • Recursive decoding (if not supported in the encoder in the first iteration).

@willcl-ark
Copy link
Member Author

I like where this is going now. Thanks for the conversation.

I've written a branch implementing this previous version as I wanted to see what the changes would look like (although I'm not convinced I'm handling OP_1NEGATE properly yet, even though tests are passing...)

Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).

I think this will make the representation even nicer to interpret and will gladly make this change.

Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

I'll take a stab at implementing these nested scripts as well, as I think they'd be a nice improvement too...

@ajtowns
Copy link
Contributor

ajtowns commented Nov 15, 2023

  • Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).

-2**31 -2**31+1 and 2**39-1 presumably? (original lacks the negative sign, and github can't quote superscript correctly apparently). EDIT: 0xffff_ffff is -2**31+1 with a sign-flag, rather than -2**31 being 0x8000_0000 in two's complement

  • Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with ... the hex-encoding (in wire byte order) of the push.

Per script/script.cpp:CheckMinimalPush(), 0101 isn't a minimal push, so I think it'd be better to drop the "minimal push" phrasing and just use the definition currently in brackets. (also; PUSHDATA4 > 65535 bytes)

  • Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

Any push of a single positive number between 10-99; 1,000-9,999; 100,000-999,999; 10,000,000-99,999,999; 1,000,000,000-9,999,999,999; and 100,000,000,000-549,755,813,887 would be ambiguous, no?

@sipa
Copy link
Member

sipa commented Nov 15, 2023

@ajtowns Right on all 3 counts. Instead of limiting the integer range, maybe just "all minimal pushes of up to 5 bytes, which push a minimally-encoded integer".

@sipa
Copy link
Member

sipa commented Nov 15, 2023

One more question: what to do with unknown opcodes, and undecodable bytes? The current FormatScript just turns it into bare hex with 0x prefix. I guess we could keep the "0x" prefix to denote "raw stuff", but the current format is pretty misleading (it uses wire byte order, while "0x" suggests they're BE numbers) and incompatible with allowing "0x..." for denoting actual integers.

Suggestion: use RAW(...) with "..." the hex in wire byte order for those. I'd prefer not to use RAW<...> because I associate <> with "pushing", which wouldn't be correct here. I'm not too happy with having a special keyword in the syntax just for that either, so suggestions welcome.

@willcl-ark
Copy link
Member Author

Currently GetOpCode() will return OP_UNKNOWN for unknown opcodes which seems reasonable. Dropping the OP_ would appear in the asm as UNKNOWN.

As for undecodable bytes, if we did want a keyword, perhaps something more explicit like UNDECODABLE<...> would be better? We are essentially using what I would expect RAW to denote for > 5 byte minimal pushes if we go with this:

Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with ... the hex-encoding (in wire byte order) of the push.

@sipa
Copy link
Member

sipa commented Nov 15, 2023

Currently GetOpCode() will return OP_UNKNOWN for unknown opcodes which seems reasonable.

That's trivially ambiguous, as all unknown opcodes are mapped to the same thing.

As for undecodable bytes, if we did want a keyword, perhaps something more explicit like UNDECODABLE<...> would be better?

Maybe, but I wouldn't use <...> because it's not a push.

We are essentially using what I would expect RAW to denote for > 5 byte minimal pushes if we go with this:

No, any pushes become either decimal numbers, or <...>, or PUSHDATA[124]<...>. Pushes are not undecodable.

EDIT: actually, I think all undecodable things (ignoring unknown opcodes) are pushes (but ones which straddle the end of the script boundary).

@willcl-ark willcl-ark marked this pull request as draft November 15, 2023 20:02
@willcl-ark willcl-ark force-pushed the asm-full-hex branch 2 times, most recently from 016803c to 89547bb Compare November 15, 2023 22:31
@willcl-ark
Copy link
Member Author

Marked as draft for now and pushed what I have worked on so far.

I haven't looked at implementing undecodable bytes or nested scripts yet.

if (CheckMinimalPush(vch, opcode)) {
if (vch.size() <= 5 ) {
// Return decimal for minimially-encoded values <= 5 bytes (OP_CLTV / OP_CSV accept 5-byte numbers)
CScriptNum n(vch, false, 5);
Copy link
Member

@sipa sipa Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want fRequireMinimal=true true here, otherwise it'll turn non-minimally-encoded values into decimal too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it should be true as a belt-and-suspenders, but we are inside if (CheckMinimalPush(vch, opcode)) already here, so shouldn't happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte code 024000 is a minimal push of the two-byte vector 4000 (so passes CheckMinimalPush) but it's not a minimal CScriptNum since it just evaluates to 64, and should have been 0140 (so fails fRequireMinimal due to the trailing 0).

Copy link
Member

@sipa sipa Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are two distinct concepts of minimality:

  • Minimal pushes apply to the interpretation of stack elements as byte vectors, and are about whether you used the right opcode (OP_n, direct push, OP_PUSHDATA1, OP_PUSHDATA2, OP_PUSHDATA4): you have to use the first applicable one from that list for a push to be a minimal push. It's relevant even for non-numeric data in script (e.g. you shouldn't use OP_PUSHDATA1 to push a public key, as a direct push suffices).
  • Minimally-encoded integers apply to the interpretation of stack elements as numbers, and are about whether no surplus bytes were introduced when encoding the number as a byte vector but not about how that byte vector gets pushed. The byte vectors 42, 4200, 420000 all encode the number 66, and 83, 0380, 030080, ... all encode the number -3, but you have to use the first one from each list for it to be minimally-encoded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations.

I've refactored out handling pushed data, as I wasn't enjoying all the nested if statements, and notice that we are currently handling "non-minimal pushes" the same as minimal pushes with non-minimally-encoded values, which is to use OPCODE<hex>. This seems ok, and it will not be a lossy decode, but do we want to differentiate between the two pathways to this decode format?

See first and last case in snippet to illustrate what I mean above:

/*
 * Format pushed bytes into appropriate ASM format
 */
std::string FormatPushDataAsm(const std::vector<unsigned char>& vch, opcodetype opcode)
{
    // Use OPCODE<hex> for non-minimal pushes
    // TODO: There are no tests for this currently
    if (!CheckMinimalPush(vch, opcode)) {
        return GetOpNameAsm(opcode) + BracketStr(HexStr(vch));
    }

    // Use <hex> for minimal pushes > 5 bytes
    if (!(vch.size() <= 5) ) {
        return BracketStr(HexStr(vch));
    }

    // Use decimal for minimally-encoded, minimal pushes <= 5 bytes
    // Note: OP_CLTV / OP_CSV accept 5-byte numbers
    try {
        CScriptNum n{vch, true, 5};
        return strprintf("%lld", n.GetInt64());
    }

    // Use OPCODE<hex> for non-minimally-encoded minimal pushes
    // TODO: There are no tests for this currently
    catch (scriptnum_error& e) {
        return GetOpNameAsm(opcode) + BracketStr(HexStr(vch));
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but what I mean is that in my current impl there are two ways we can end up with OPCODE<...>:

  1. Wasn't a minimal push
  2. Was a minimal push <5 bytes, but wasn't a minimally-encoded decimal value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, right. I think that's redundant.

I'd go for something like:

  • If minimal push, not over 5 bytes, and minimally encoded, use decimal.
  • Otherwise, if minimal push, or direct push, use <...>.
  • Otherwise, use OPCODE<...>.

Copy link
Member

@sipa sipa Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unambiguous, with the following decoding rules:

  • Decimal numbers are turned into OP_n if applicable, otherwise into a direct push of the minimally-encoded form of that number.
  • <...> is turned into a direct push if up to 75 bytes, into OP_PUSHDATA1 if below 256 bytes, into OP_PUSHDATA2 if below 65536 bytes, and into OP_PUSHDATA4 otherwise.
  • OPCODE<...> is turned into a push using the relevant opcode.

In particular (I think @ajtowns pointed this out before as well), using <...> for non-minimal but direct pushes is fine, as it cannot represent anything else (not an OP_n, because those use decimal, and not a PUSHDATA opcode because if those were used for sizes <= 75 bytes, they'd have used OPCODE<...>).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This just means changing the final catch to return BracketStr(HexStr(vch));, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also,

if (!(vch.size() <= 5) ) {

Can I introduce you to > ? 😄

} else {
str += HexStr(vch);
// Otherwise display the push as LE hex enclosed in angle brackets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this to be a controversial opinion, but I disagree with calling this "little endian". That's a term that applies to encoding numbers to bytes/bits. Since what's being encoded isn't a number, or to be interpreted as a number, endianness is inapplicable. It's just encoding bytes in hexadecimal format, in order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not little endian, I agree.

@sipa
Copy link
Member

sipa commented Nov 17, 2023

Brainstormy idea for undecodable scripts.

These are necessarily pushes (direct or OP_PUSHDATA[124]) whose prescribed length exceeds the end of the script. What about using <...+N> or OPCODE<...+N>, where ... is hexadecimal as before, and N is a decimal number indicating how many bytes are missing?

EDIT: for OP_PUSHDATA[124] it's also possible that the 1-4 byte length field itself straddles the script end, and that isn't covered by this suggestion.

@willcl-ark
Copy link
Member Author

Brainstormy idea for undecodable scripts.

Currently failure to fully parse in GetScriptOp will only return false if we run out of bytes to read either the length, or from the operand, and we don't store the opcode or the available remaining bytes from the iterator in opcodeRet or pvchRet. But we could change GetScriptOpt to store them the failure case too.

<...+N> or OPCODE<...+N>

I don't think it makes much difference which flavour is chosen here; both would require the N to be parsed to calculate the total length and therefore determine the opcode that was used before it was encoded. As there are bytes missing, I don't think we can tell if this was a minimal push or not either. I'd probably slightly lean towards using OPCODE<...+N> as it's slightly more human-friendly?

@sipa
Copy link
Member

sipa commented Nov 17, 2023

I'd probably slightly lean towards using OPCODE<...+N> as it's slightly more human-friendly?

In case it's a direct push there is no opcode at all. I didn't mean this as a flavor to choose from; we need both.

@ajtowns
Copy link
Contributor

ajtowns commented Nov 22, 2023

Brainstormy idea for undecodable scripts.

These are necessarily pushes (direct or OP_PUSHDATA[124]) whose prescribed length exceeds the end of the script. What about using <...+N> or OPCODE<...+N>, where ... is hexadecimal as before, and N is a decimal number indicating how many bytes are missing?

EDIT: for OP_PUSHDATA[124] it's also possible that the 1-4 byte length field itself straddles the script end, and that isn't covered by this suggestion.

If you're adding extra encoding that need to be parsed (+N), maybe just have UNPARSABLE<hex> or GARBAGE<hex> to capture the trailing data that wasn't a full push? Seems less likely to be confusing (easy to miss a +30 at the end of a hex string), and works even if we change an OP_SUCCESS into a multibyte opcode in future?

@sipa
Copy link
Member

sipa commented Nov 22, 2023

Yeah, my only (rather minor) reservation about UNPARSEABLE<...> is that I think of <...> as a way of indicating "stuff being pushed", which wouldn't be the case here. And if we're going to introduce some special syntax for unparseable stuff anyway, I thought it might be useful to integrate into the push syntax itself (as everything unparseable is an incomplete push).

That said, I agree that the +N thing isn't super readabable, and it's incomplete anyway (can't deal with OP_PUSHDATA[124] whose length descriptor itself is missing).

So maybe just UNPARSEABLE(...)?

@ajtowns
Copy link
Contributor

ajtowns commented Nov 22, 2023

Yeah, my only (rather minor) reservation about UNPARSEABLE<...> is that I think of <...> as a way of indicating "stuff being pushed",

Yeah, I guess being able to scan and say <...> means data is going onto the stack makes sense.

So maybe just UNPARSEABLE(...)?

Could do UNPARSEABLE:04010203 or something (colon instead of brackets).

(Arguably, everything after the first OP_SUCCESS in a tapscript could also be considered trailing hex that shouldn't be decoded. Probably more useful to decode it anyway, though)

@maflcko
Copy link
Member

maflcko commented Mar 11, 2024

Are you still working on this?

We don't decode these on taproot signatures already, modify non-taproot
signatures to match.
Strip unneccesary leading "OP_" prefix from opcodes when they are
rendered as ASM.
@willcl-ark willcl-ark changed the title Use LE hex-encoded representations in script ASM for pushed values <= 4 bytes Fix ASM ambiguity Mar 12, 2024
A simple helper function to angle bracket a hex string.
* Return minimal pushes < 5 bytes as decimal
* Return non-minimal pushes < 5 bytes wrapped as OP_CODE<...>
* Return pushes > 5 bytes as hex wrapped as <...>
@sipa
Copy link
Member

sipa commented May 8, 2024

Attempt to revive the discussion: #27795 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants