Fix ASM ambiguity #28824

willcl-ark · 2023-11-08T16:07:36Z

Closes: #27795
Closes: #7996

Previously ScriptToAsmStr returned hex-encoded values, except if data length was <= 4 bytes, in which case it displayed using decimal encoding.

Fix the ASM representation by specifying an unambiguous encoding:

Drop OP_ prefix from all opcodes
For non-minimal pushes prefix with the opcode and enclose pushed hex value in angle brackets PUSHDATA1<0100>
For minimal pushes:
- If > 5 bytes display pushed hex value enclosed in angle brackets <...>
- If <= 5 bytes:
  - If minimally-encoded display pushed value in decimal without angle brackets 42
  - If not minimally-encoded display pushed hex value enclosed in angle brackets <4200>
- Display undecodable bytes using UNDECODABLE(...)

This should permit unambiguous decoding in the future.

DrahtBot · 2023-11-08T16:07:38Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Closes: bitcoin#27795 Closes: bitcoin#7996 Previously ScriptToAsmStr returned hex-encoded integers, except if data length was <= 4 bytes, in which case it displayed using decimal encoding. This was originally added by Satoshi in 4405b78/script.h#L298-L305 Remove the decimal carve-out for small pushes. Github-Pull: bitcoin#28824 Rebased-From: d6c0c4b

ajtowns · 2023-11-09T04:12:28Z

test/functional/rpc_decodescript.py

@@ -133,7 +133,7 @@ def decodescript_script_pub_key(self):
        cltv_script = '63' + push_public_key + 'ad670320a107b17568' + push_public_key + 'ac'
        rpc_result = self.nodes[0].decodescript(cltv_script)
        assert_equal('nonstandard', rpc_result['type'])
-        assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 500000 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm'])
+        assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 20a107 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm'])


This seems much harder to understand to me -- you need to convert it to 0x07a120 then convert that to decimal to get back to the nice round number block height? In particular mistaking it for 0x20a107 (2,138,375) would be particularly misleading.

ajtowns · 2023-11-09T04:14:19Z

test/functional/wallet_send.py

@@ -170,7 +170,7 @@ def test_send(self, from_wallet, to_wallet=None, amount=None, data=None,
                else:
                    assert_greater_than(from_balance_before - from_balance, amount)
            else:
-                assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 35"), None)
+                assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 23"), None)


This also seems confusing, particularly if it appears in something like DUP SIZE 20 EQUALVERIFY HASH160 <x> EQUAL -- whoops, that's actually saying the input should be 32 bytes, not 20 bytes.

MatthewLM · 2023-11-09T09:32:39Z

In the dart library coinlib (https://pub.dev/packages/coinlib) I followed the same approach of encoding everything in HEX with integers encoded as LE. However -1 is encoded as -1. See the code here: https://github.com/peercoin/coinlib/blob/master/coinlib/lib/src/scripts/operations.dart

I did this to avoid changing the ASM too much. However, I'm happy to move towards using 0x for hex and presenting decimal for smaller data of up-to 32bits if this is introduced in Bitcoin. This would make the ASM easy to understand and read.

willcl-ark · 2023-11-09T16:25:19Z

@ajtowns @MatthewLM thank for your comments.

I agree that after these changes some ASM fields will be less human-readable-friendly in LE hex.

Prefixing with 0d was another possibility discussed previously, but there still exists a chance that someone will interpret these as hex-encoded values, as d is a valid hex char (whereas x is clearly not decimal or hex). e.g this is not necessarily that much clearer, but does offer some improvement:

$ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"1112131415 0d1112131415"

$ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
"1234567890 0d500000 OP_CHECKLOCKTIMEVERIFY"

Thinking about this more, it feels like the question to be answered is perhaps "is the asm field supposed to be human or machine readable?" If the former, then maintaining decimal for small values, and adding a hex-prefix for > 4 byte values might be most desirable, even it it leaves some ambiguity in some circumstances?

$ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
"0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY"

This would create a moderately inconvenient interface in a few of the tests (and possibly some downstream users relying on this output) where hex prefixes, if present, have to be stripped from rpc output, but perhaps that's not too terrible.

If we are designing the asm output to be a canonical, machine-readable format then my opinion remains that using hex-only makes the most sense.

FWIW btcdeb interprets all values as decimal but prints a warning that they are ambiguous:

$ btcc 1234567890 500000 OP_CHECKLOCKTIMEVERIFY
warning: ambiguous input 1234567890 is interpreted as a numeric value; use 0x1234567890 to force into hexadecimal interpretation
warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
04d20296490320a107b1

$ btcc 0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY
warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
0512345678900320a107b1

And consequently generates a different script without manually specifying hex. I don't know what other tools do by default.

MatthewLM · 2023-11-09T16:31:06Z

I agree that 0d is not a good idea, much better to go with 0x for hex.

ajtowns · 2023-11-09T20:39:41Z

Thinking about this more, it feels like the question to be answered is perhaps "is the asm field supposed to be human or machine readable?"

I think it's human readable? For machines we could just give the raw hex script and not bother with an asm format.

Anyway, how about going for something more left field and having decodescript 0511121314150457c74942 output [1112131415] 1112131415 where the square brackets indicated "quoted hexidecimal", and values without square brackets are always decimal numbers within CScriptNum range? That would still be pretty clear for pubkeys:

2 [027d28a3580d0823dbee34f1df927ee58664eb8349e79b94017ddaf3f7e159153f] [03a3af0f49a21d29106ebeef9b3c3d69fc375c856a7153d97156a5b7a161ca6c25] 2 OP_CHECKMULTISIG

Could perhaps also represent non-minimal pushes that way, eg the different encodings of -1 or [ff], something like:

 decodescript 4f -> -1
 decodescript 0181 -> [0:81]
 decodescript 4c0181 -> [1:81]
 decodescript 4d010081 -> [2:81]
 decodescript 4e0100000081 -> [4:81]

willcl-ark · 2023-11-13T12:57:44Z

I think it's human readable? For machines we could just give the raw hex script and not bother with an asm format.

This has always been my interpretation too.

Anyway, how about going for something more left field...

Square brackets is something I could get behind; to me it is clear from your example that the brackets are indicating something, and just by looking at them I can see in most cases that they're hex values. To use the previous (less common?) example formatted with the brackets:

$ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"[1112131415] 1112131415"

IMO it's again clear to me now that there's something different about these two values, and if I didn't know what it was then I should find out.

One question I have is whether this will interfere with the scripthash types too badly? e.g. Is this nested bracket undesirable?

$ src/bitcoin-cli -regtest decoderawtransaction 0100000001696a20784a2c70143f634e95227dbdfdf0ecd51647052e70854512235f5986ca010000008a47304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb014104d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536ffffffff0100e1f505000000001976a914eb6c6e0cdb2d256a32d97b8df1fc75d1920d9bca88ac00000000 | jq .vin[].scriptSig.asm
"[304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb[ALL]] [04d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536]"

specifically:

[... 2dfb[ALL]]

It might be preferable to have this format instead?

[... 2dfb][ALL]

MatthewLM · 2023-11-13T13:37:26Z

0x is much more standard and recognisable than [].

ajtowns · 2023-11-13T23:09:48Z

0x is much more standard and recognisable than [].

It's pretty misleading though, since CScriptNum is little-endian: if you see 0x0100 you would expect that to be 256, but for us it would actually be non-minimally-encoded 1; For us, 0x0201 would actually be less than 0x0102, etc.

Note this applies to block hashes, as well, so if you provide block 816651's header and execute HASH256, you end up with [7d6368795d9675a198bd1dbe3d87dbdbbc3e0f3560a202000000000000000000]; if you were to write 0x00000000000000000002a260350f3ebcdbdb873dbe1dbd98a175965d7968637d to refer to the block hash, the same way we do in kernel/chainparams.cpp, you'd end up with the wrong thing. But you can't have "0x" just reverse things either, as that would mess up 33-byte pubkeys (they'd now end with 02 or 03 instead of beginning with it) and signatures.

ajtowns · 2023-11-13T23:34:49Z

One question I have is whether this will interfere with the scripthash types too badly? e.g. Is this nested bracket undesirable?

specifically:
[... 2dfb[ALL]]
It might be preferable to have this format instead?
[... 2dfb][ALL]

I guess it being inside the brackets seems slightly clearer to me? Could have different brackets for the two things, eg [... 2dfb<ALL>]. Could perhaps also drop the feature entirely (reverting #5264); note that we don't display sighash flags for taproot signatures.

MatthewLM · 2023-11-14T11:05:52Z

It's pretty misleading though, since CScriptNum is little-endian: if you see 0x0100 you would expect that to be 256, but for us it would actually be non-minimally-encoded 1; For us, 0x0201 would actually be less than 0x0102, etc.

Those smaller numbers will be shown in decimal still. For larger pushdata, hex is already shown in the little-endian encoded byte order. If 0x is too suggestive of a big-endian hex number (a fair point), then maybe bytes(123456789abcdef) is clearer and less ambiguous (maybe use push instead of bytes). If following that scheme, signatures could be something like sig(...2dfb, ALL). Possibly public keys could be pk(...).

sipa · 2023-11-14T13:28:07Z

Do we actually want to resolve all forms of ambiguity? E.g. what about the distinction between an OP_1 vs. a direct push of 0x01, vs. OP_PUSHDATA1 of 0x01? I think it'd be nice if there was a one-to-one correspondence between the disassembly and the actual script bytes.

My contribution to this bikeshedding would be:

OP_n -> n (so -1, 0, 1, ..., 10, ..., 16).
Drop "OP_" prefix for other opcodes too; they're redundant.
Other minimal pushes (so direct pushes, as well as OP_PUSHDATA1 over 75 bytes, and OP_PUSHDATA2 over 255 bytes):
- If the data pushed is a minimally-encoded positive number between 0 and 2^39-1 (OP_CLTV / OP_CSV accept 5-byte numbers) -> <0x12345678> (BE hex, as that's what a human expects for a numeric value with 0x).
- Otherwise, <1234567890ABCDEF> (in wire byte order, no 0x because not attempting to convey a number)
For non-minimal pushes, use PUSHDATA1<...> or PUSHDATA2<...> (with the contents of ... using the same rules as for minimal pushes).
Drop the sighash byte type decoding. I don't feel it matches what asm is trying to do.
Optionally, for P2SH(-like) scriptSigs, the final redeemScript push could be recursively decoded (so inside <...>, put the asm decoding of the script being pushed), while remaining fully unambiguous.

ajtowns · 2023-11-15T03:50:57Z

My contribution to this bikeshedding would be:
* OP_n -> n (so -1, 0, 1, ..., 10, ..., 16).
* Drop "OP_" prefix for other opcodes too; they're redundant.
* Drop the sighash byte type decoding. I don't feel it matches what asm is trying to do.

👍

For non-minimal pushes, use PUSHDATA1<...> or PUSHDATA2<...> (with the contents of ... using the same rules as for minimal pushes).

Also PUSHDATA4<...> for completeness, presumably. A direct push of values -1 and 1..16 is also non-minimal (ie, 0101 instead of 51), could perhaps use PUSHDATA<..> (without 1, 2 or 4) for that case.

Other minimal pushes (so direct pushes, as well as OP_PUSHDATA1 over 75 bytes, and OP_PUSHDATA2 over 255 bytes):

If the data pushed is a minimally-encoded positive number between 0 and 2^39-1 (OP_CLTV / OP_CSV accept 5-byte numbers) -> <0x12345678> (BE hex, as that's what a human expects for a numeric value with 0x).

Provided you've already dealt with non-minimally encoded -1 and 1..16, using decimals seems nicer here? You can cope with negatives cleanly (-2 vs <82>), and it's one less encoding variation?

Optionally, for P2SH(-like) scriptSigs, the final redeemScript push could be recursively decoded (so inside <...>, put the asm decoding of the script being pushed), while remaining fully unambiguous.

That would be nice. Probably requires remembering info from the scriptPubKey and using it when decoding the scriptSig or witness.

MatthewLM · 2023-11-15T09:29:32Z

Provided you've already dealt with non-minimally encoded -1 and 1..16, using decimals seems nicer here? You can cope with negatives cleanly (-2 vs <82>), and it's one less encoding variation?

I was thinking it would be nice to retain decimal but, due to ambiguity that could arise with CLTV and CSV, I'm not sure.

Having embedded ASM for redeem scripts and Tapscripts would be very useful.

ajtowns · 2023-11-15T10:35:55Z

I was thinking it would be nice to retain decimal but, due to ambiguity that could arise with CLTV and CSV, I'm not sure.

What ambiguity?

MatthewLM · 2023-11-15T12:31:16Z

For 32-bit signed integers, the 32nd leftmost bit is negative, but that's not true for CLTV and CSV which uses 5 bytes. So a push data could represent a negative 32-bit signed integer, or a positive 40-bit signed integer if I understand the CScriptNum without looking more closely at it.

sipa · 2023-11-15T12:33:39Z

@MatthewLM I don't think that's correct. What's special in CSV and CLTV is that they permit numbers whose encoding is up to 5 bytes (rather than the math opcodes which only support encodings up to 4 bytes on input). But the encoding algorithm is the same: the top bit is the sign.

MatthewLM · 2023-11-15T12:36:48Z

@sipa I'll have to look at the code. I imagine it is two's complement? Then the negative bit will be different for each type of integer meaning that a 4 byte pushdata that would encode a negative 32-bit signed integer could also be interpreted as a positive 40-bit signed integer.

sipa · 2023-11-15T12:47:55Z

@ajtowns Ok, iterating on that, new proposal:

Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 2³¹ and 2³⁹-1, is just encoded in decimal directly (no <> or anything).
OP_ prefix dropped in general
Drop sighash type.
Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with ... the hex-encoding (in wire byte order) of the push.
Non-minimal pushes become PUSHDATA1<...>, PUSHDATA2<...>, or PUSHDATA4<...>.
Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

There is no need for PUSHDATA<...> here, because a direct push of 3 for example (0103 in hex) would become <03>, while OP_3 would become 3.

@MatthewLM I think you're missing something, but I don't see what. There is no ambiguity. Every byte encoding represents exactly one number, with a well-defined sign.

MatthewLM · 2023-11-15T12:56:24Z

@sipa Sorry, you are right. I looked at the code again. Bitcoin doesn't use two's complement for this (I have to keep in mind that Bitcoin has a lot of quirks). It uses a sign bit that negates the other bits, and this sign bit depends on the size of the push data (whatever the last byte is), so there is no ambiguity.

Therefore I think minimally pushed numbers ought to be displayed as decimal for readability reasons as you now suggest.

Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

Or any other decimal that has an even number of digits. Maybe use something like SCRIPT<...>?

sipa · 2023-11-15T13:01:46Z

@MatthewLM Right, any single numeric push of even number of digits inside <> would be ambiguous. But also, scripts consisting of a single push are generally useless as a redeemscript (they'd be anyone-can-spend), so I'm not sure it's an issue. If you actually encounter one of those, just don't do recursive decoding.

sipa · 2023-11-15T13:24:12Z

One more thought: if the format we end up with is unambiguous, there could also be a function/RPC/tool to convert asm back to script (and we can have fuzz tests that the roundtripping encoding+decoding is exact!). If so, we can probably permit a few things in the decoding that the encoder might not (yet) support:

Optionally OP_ prefixing opcodes (and using aliases like TRUE or OP_TRUE for OP_1).
Using 0x... for BE hex encoded numbers in all places where a decimal number is supported (so outside <>).
PUSHDATA1<...> or PUSHDATA2<...> notation even when the push is minimal.
Recursive decoding (if not supported in the encoder in the first iteration).

willcl-ark · 2023-11-15T13:25:39Z

I like where this is going now. Thanks for the conversation.

I've written a branch implementing this previous version as I wanted to see what the changes would look like (although I'm not convinced I'm handling OP_1NEGATE properly yet, even though tests are passing...)

Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).

I think this will make the representation even nicer to interpret and will gladly make this change.

Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

I'll take a stab at implementing these nested scripts as well, as I think they'd be a nice improvement too...

ajtowns · 2023-11-15T13:49:41Z

Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).

~~-2**31~~ -2**31+1 and 2**39-1 presumably? (original lacks the negative sign, and github can't quote superscript correctly apparently). EDIT: 0xffff_ffff is -2**31+1 with a sign-flag, rather than -2**31 being 0x8000_0000 in two's complement

Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with ... the hex-encoding (in wire byte order) of the push.

Per script/script.cpp:CheckMinimalPush(), 0101 isn't a minimal push, so I think it'd be better to drop the "minimal push" phrasing and just use the definition currently in brackets. (also; PUSHDATA4 > 65535 bytes)

Inside <...> instead of having hex data, it's also allowed to put another decoded script (as long the script doesn't consist of exactly one 2-digit number push, because that'd be ambiguous...).

Any push of a single positive number between 10-99; 1,000-9,999; 100,000-999,999; 10,000,000-99,999,999; 1,000,000,000-9,999,999,999; and 100,000,000,000-549,755,813,887 would be ambiguous, no?

sipa · 2023-11-15T14:00:01Z

@ajtowns Right on all 3 counts. Instead of limiting the integer range, maybe just "all minimal pushes of up to 5 bytes, which push a minimally-encoded integer".

sipa · 2023-11-15T14:05:45Z

One more question: what to do with unknown opcodes, and undecodable bytes? The current FormatScript just turns it into bare hex with 0x prefix. I guess we could keep the "0x" prefix to denote "raw stuff", but the current format is pretty misleading (it uses wire byte order, while "0x" suggests they're BE numbers) and incompatible with allowing "0x..." for denoting actual integers.

Suggestion: use RAW(...) with "..." the hex in wire byte order for those. I'd prefer not to use RAW<...> because I associate <> with "pushing", which wouldn't be correct here. I'm not too happy with having a special keyword in the syntax just for that either, so suggestions welcome.

willcl-ark · 2023-11-15T14:14:40Z

Currently GetOpCode() will return OP_UNKNOWN for unknown opcodes which seems reasonable. Dropping the OP_ would appear in the asm as UNKNOWN.

As for undecodable bytes, if we did want a keyword, perhaps something more explicit like UNDECODABLE<...> would be better? We are essentially using what I would expect RAW to denote for > 5 byte minimal pushes if we go with this:

Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with ... the hex-encoding (in wire byte order) of the push.

sipa · 2023-11-15T14:18:32Z

Currently GetOpCode() will return OP_UNKNOWN for unknown opcodes which seems reasonable.

That's trivially ambiguous, as all unknown opcodes are mapped to the same thing.

As for undecodable bytes, if we did want a keyword, perhaps something more explicit like UNDECODABLE<...> would be better?

Maybe, but I wouldn't use <...> because it's not a push.

We are essentially using what I would expect RAW to denote for > 5 byte minimal pushes if we go with this:

No, any pushes become either decimal numbers, or <...>, or PUSHDATA[124]<...>. ~~Pushes are not undecodable.~~

EDIT: actually, I think all undecodable things (ignoring unknown opcodes) are pushes (but ones which straddle the end of the script boundary).

willcl-ark · 2023-11-15T22:33:26Z

Marked as draft for now and pushed what I have worked on so far.

I haven't looked at implementing undecodable bytes or nested scripts yet.

sipa · 2023-11-15T23:53:04Z

src/core_write.cpp

+            if (CheckMinimalPush(vch, opcode)) {
+                if (vch.size() <= 5 ) {
+                    // Return decimal for minimially-encoded values <= 5 bytes (OP_CLTV / OP_CSV accept 5-byte numbers)
+                    CScriptNum n(vch, false, 5);


I think you want fRequireMinimal=true true here, otherwise it'll turn non-minimally-encoded values into decimal too.

Agree that it should be true as a belt-and-suspenders, but we are inside if (CheckMinimalPush(vch, opcode)) already here, so shouldn't happen?

Byte code 024000 is a minimal push of the two-byte vector 4000 (so passes CheckMinimalPush) but it's not a minimal CScriptNum since it just evaluates to 64, and should have been 0140 (so fails fRequireMinimal due to the trailing 0).

Yeah, there are two distinct concepts of minimality:

Minimal pushes apply to the interpretation of stack elements as byte vectors, and are about whether you used the right opcode (OP_n, direct push, OP_PUSHDATA1, OP_PUSHDATA2, OP_PUSHDATA4): you have to use the first applicable one from that list for a push to be a minimal push. It's relevant even for non-numeric data in script (e.g. you shouldn't use OP_PUSHDATA1 to push a public key, as a direct push suffices).

Minimally-encoded integers apply to the interpretation of stack elements as numbers, and are about whether no surplus bytes were introduced when encoding the number as a byte vector but not about how that byte vector gets pushed. The byte vectors 42, 4200, 420000 all encode the number 66, and 83, 0380, 030080, ... all encode the number -3, but you have to use the first one from each list for it to be minimally-encoded.

Thanks for the explanations.

I've refactored out handling pushed data, as I wasn't enjoying all the nested if statements, and notice that we are currently handling "non-minimal pushes" the same as minimal pushes with non-minimally-encoded values, which is to use OPCODE<hex>. This seems ok, and it will not be a lossy decode, but do we want to differentiate between the two pathways to this decode format?

See first and last case in snippet to illustrate what I mean above:

/* * Format pushed bytes into appropriate ASM format */ std::string FormatPushDataAsm(const std::vector<unsigned char>& vch, opcodetype opcode) { // Use OPCODE<hex> for non-minimal pushes // TODO: There are no tests for this currently if (!CheckMinimalPush(vch, opcode)) { return GetOpNameAsm(opcode) + BracketStr(HexStr(vch)); } // Use <hex> for minimal pushes > 5 bytes if (!(vch.size() <= 5) ) { return BracketStr(HexStr(vch)); } // Use decimal for minimally-encoded, minimal pushes <= 5 bytes // Note: OP_CLTV / OP_CSV accept 5-byte numbers try { CScriptNum n{vch, true, 5}; return strprintf("%lld", n.GetInt64()); } // Use OPCODE<hex> for non-minimally-encoded minimal pushes // TODO: There are no tests for this currently catch (scriptnum_error& e) { return GetOpNameAsm(opcode) + BracketStr(HexStr(vch)); } }

Sure, but what I mean is that in my current impl there are two ways we can end up with OPCODE<...>:

Wasn't a minimal push

Was a minimal push <5 bytes, but wasn't a minimally-encoded decimal value

Hmm, right. I think that's redundant.

I'd go for something like:

If minimal push, not over 5 bytes, and minimally encoded, use decimal.

Otherwise, if minimal push, or direct push, use <...>.

Otherwise, use OPCODE<...>.

This is unambiguous, with the following decoding rules:

Decimal numbers are turned into OP_n if applicable, otherwise into a direct push of the minimally-encoded form of that number.

<...> is turned into a direct push if up to 75 bytes, into OP_PUSHDATA1 if below 256 bytes, into OP_PUSHDATA2 if below 65536 bytes, and into OP_PUSHDATA4 otherwise.

OPCODE<...> is turned into a push using the relevant opcode.

In particular (I think @ajtowns pointed this out before as well), using <...> for non-minimal but direct pushes is fine, as it cannot represent anything else (not an OP_n, because those use decimal, and not a PUSHDATA opcode because if those were used for sizes <= 75 bytes, they'd have used OPCODE<...>).

Yeah. This just means changing the final catch to return BracketStr(HexStr(vch));, no?

Also,

if (!(vch.size() <= 5) ) {

Can I introduce you to > ? 😄

sipa · 2023-11-16T00:07:15Z

src/core_write.cpp

                } else {
-                    str += HexStr(vch);
+                    // Otherwise display the push as LE hex enclosed in angle brackets


I expect this to be a controversial opinion, but I disagree with calling this "little endian". That's a term that applies to encoding numbers to bytes/bits. Since what's being encoded isn't a number, or to be interpreted as a number, endianness is inapplicable. It's just encoding bytes in hexadecimal format, in order.

It's not little endian, I agree.

sipa · 2023-11-17T13:35:46Z

Brainstormy idea for undecodable scripts.

These are necessarily pushes (direct or OP_PUSHDATA[124]) whose prescribed length exceeds the end of the script. What about using <...+N> or OPCODE<...+N>, where ... is hexadecimal as before, and N is a decimal number indicating how many bytes are missing?

EDIT: for OP_PUSHDATA[124] it's also possible that the 1-4 byte length field itself straddles the script end, and that isn't covered by this suggestion.

willcl-ark · 2023-11-17T14:06:10Z

Brainstormy idea for undecodable scripts.

Currently failure to fully parse in GetScriptOp will only return false if we run out of bytes to read either the length, or from the operand, and we don't store the opcode or the available remaining bytes from the iterator in opcodeRet or pvchRet. But we could change GetScriptOpt to store them the failure case too.

<...+N> or OPCODE<...+N>

I don't think it makes much difference which flavour is chosen here; both would require the N to be parsed to calculate the total length and therefore determine the opcode that was used before it was encoded. As there are bytes missing, I don't think we can tell if this was a minimal push or not either. I'd probably slightly lean towards using OPCODE<...+N> as it's slightly more human-friendly?

sipa · 2023-11-17T14:08:44Z

I'd probably slightly lean towards using OPCODE<...+N> as it's slightly more human-friendly?

In case it's a direct push there is no opcode at all. I didn't mean this as a flavor to choose from; we need both.

ajtowns · 2023-11-22T01:03:05Z

Brainstormy idea for undecodable scripts.

These are necessarily pushes (direct or OP_PUSHDATA[124]) whose prescribed length exceeds the end of the script. What about using <...+N> or OPCODE<...+N>, where ... is hexadecimal as before, and N is a decimal number indicating how many bytes are missing?

EDIT: for OP_PUSHDATA[124] it's also possible that the 1-4 byte length field itself straddles the script end, and that isn't covered by this suggestion.

If you're adding extra encoding that need to be parsed (+N), maybe just have UNPARSABLE<hex> or GARBAGE<hex> to capture the trailing data that wasn't a full push? Seems less likely to be confusing (easy to miss a +30 at the end of a hex string), and works even if we change an OP_SUCCESS into a multibyte opcode in future?

sipa · 2023-11-22T01:08:42Z

Yeah, my only (rather minor) reservation about UNPARSEABLE<...> is that I think of <...> as a way of indicating "stuff being pushed", which wouldn't be the case here. And if we're going to introduce some special syntax for unparseable stuff anyway, I thought it might be useful to integrate into the push syntax itself (as everything unparseable is an incomplete push).

That said, I agree that the +N thing isn't super readabable, and it's incomplete anyway (can't deal with OP_PUSHDATA[124] whose length descriptor itself is missing).

So maybe just UNPARSEABLE(...)?

ajtowns · 2023-11-22T03:58:21Z

Yeah, my only (rather minor) reservation about UNPARSEABLE<...> is that I think of <...> as a way of indicating "stuff being pushed",

Yeah, I guess being able to scan and say <...> means data is going onto the stack makes sense.

So maybe just UNPARSEABLE(...)?

Could do UNPARSEABLE:04010203 or something (colon instead of brackets).

(Arguably, everything after the first OP_SUCCESS in a tapscript could also be considered trailing hex that shouldn't be decoded. Probably more useful to decode it anyway, though)

maflcko · 2024-03-11T07:51:44Z

Are you still working on this?

We don't decode these on taproot signatures already, modify non-taproot signatures to match.

Strip unneccesary leading "OP_" prefix from opcodes when they are rendered as ASM.

A simple helper function to angle bracket a hex string.

* Return minimal pushes < 5 bytes as decimal * Return non-minimal pushes < 5 bytes wrapped as OP_CODE<...> * Return pushes > 5 bytes as hex wrapped as <...>

sipa · 2024-05-08T14:17:14Z

Attempt to revive the discussion: #27795 (comment)

ajtowns reviewed Nov 9, 2023

View reviewed changes

willcl-ark marked this pull request as draft November 15, 2023 20:02

willcl-ark force-pushed the asm-full-hex branch 2 times, most recently from 016803c to 89547bb Compare November 15, 2023 22:31

sipa reviewed Nov 15, 2023

View reviewed changes

sipa reviewed Nov 16, 2023

View reviewed changes

DrahtBot added the CI failed label Jan 16, 2024

josibake mentioned this pull request Jan 27, 2024

Script Visualization mempool/mempool#4620

Open

Luisjdj approved these changes Mar 11, 2024

View reviewed changes

script: don't decode sighash flags in ASM repr

2bf4085

We don't decode these on taproot signatures already, modify non-taproot signatures to match.

willcl-ark force-pushed the asm-full-hex branch from 89547bb to 0a7c9d9 Compare March 12, 2024 23:23

script: add GetOpNameAsm helper fn

1c3468e

Strip unneccesary leading "OP_" prefix from opcodes when they are rendered as ASM.

willcl-ark force-pushed the asm-full-hex branch from 0a7c9d9 to 71e6cf5 Compare March 12, 2024 23:39

willcl-ark changed the title ~~Use LE hex-encoded representations in script ASM for pushed values <= 4 bytes~~ Fix ASM ambiguity Mar 12, 2024

DrahtBot removed the CI failed label Mar 13, 2024

willcl-ark added 5 commits March 13, 2024 10:52

script: drop OP_ prefix from ASM repr

88b6115

test: add asm repr asm_hex helper fn

7695272

A simple helper function to angle bracket a hex string.

script: remove ambiguity from ASM repr

614d25c

* Return minimal pushes < 5 bytes as decimal * Return non-minimal pushes < 5 bytes wrapped as OP_CODE<...> * Return pushes > 5 bytes as hex wrapped as <...>

script: return undecodable bytes

b3a8c40

doc: release note for asm repr script changes

4890626

willcl-ark force-pushed the asm-full-hex branch from 71e6cf5 to 4890626 Compare March 13, 2024 10:53

sipa mentioned this pull request May 8, 2024

Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

Open

Fix ASM ambiguity #28824

Are you sure you want to change the base?

Fix ASM ambiguity #28824

Conversation

willcl-ark commented Nov 8, 2023 • edited

DrahtBot commented Nov 8, 2023 • edited

Code Coverage

Reviews

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthewLM commented Nov 9, 2023 • edited

willcl-ark commented Nov 9, 2023

MatthewLM commented Nov 9, 2023

ajtowns commented Nov 9, 2023

willcl-ark commented Nov 13, 2023

MatthewLM commented Nov 13, 2023

ajtowns commented Nov 13, 2023

ajtowns commented Nov 13, 2023

MatthewLM commented Nov 14, 2023

sipa commented Nov 14, 2023

ajtowns commented Nov 15, 2023

MatthewLM commented Nov 15, 2023

ajtowns commented Nov 15, 2023

MatthewLM commented Nov 15, 2023 • edited

sipa commented Nov 15, 2023

MatthewLM commented Nov 15, 2023 • edited

sipa commented Nov 15, 2023 • edited

MatthewLM commented Nov 15, 2023

sipa commented Nov 15, 2023 • edited

sipa commented Nov 15, 2023 • edited

willcl-ark commented Nov 15, 2023

ajtowns commented Nov 15, 2023 • edited

sipa commented Nov 15, 2023

sipa commented Nov 15, 2023 • edited

willcl-ark commented Nov 15, 2023

sipa commented Nov 15, 2023 • edited

willcl-ark commented Nov 15, 2023

sipa Nov 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa Nov 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa Nov 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa commented Nov 17, 2023 • edited

willcl-ark commented Nov 17, 2023

sipa commented Nov 17, 2023 • edited

ajtowns commented Nov 22, 2023

sipa commented Nov 22, 2023

ajtowns commented Nov 22, 2023

maflcko commented Mar 11, 2024

sipa commented May 8, 2024

willcl-ark commented Nov 8, 2023 •

edited

DrahtBot commented Nov 8, 2023 •

edited

MatthewLM commented Nov 9, 2023 •

edited

MatthewLM commented Nov 15, 2023 •

edited

MatthewLM commented Nov 15, 2023 •

edited

sipa commented Nov 15, 2023 •

edited

sipa commented Nov 15, 2023 •

edited

sipa commented Nov 15, 2023 •

edited

ajtowns commented Nov 15, 2023 •

edited

sipa commented Nov 15, 2023 •

edited

sipa commented Nov 15, 2023 •

edited

sipa Nov 15, 2023 •

edited

sipa Nov 16, 2023 •

edited

sipa Nov 16, 2023 •

edited

sipa commented Nov 17, 2023 •

edited

sipa commented Nov 17, 2023 •

edited