Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

MatthewLM · 2023-05-31T11:14:55Z

Please describe the feature you'd like to see added.

Distinguish between decimal and hex integers in the script ASM to avoid ambiguity. Hex integers/data can be prefixed with 0x to avoid ambiguity.

Is your feature related to a problem, if so please describe it.

When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal. decodescript 0511121314150457c74942 produces ASM of 1112131415 1112131415 , despite the integers being different. There is no distinction between hex and decimal.

Describe the solution you'd like

Prefixing 0x makes the most sense to me.

Describe any alternatives you've considered

All data could be provided as hex, or "d" could be added to decimals. Removing ambiguity is the primary concern here.

Please leave any additional context

No response

The text was updated successfully, but these errors were encountered:

fanquake · 2023-05-31T12:49:36Z

Related to #7996.

darosior · 2023-06-01T14:04:40Z

When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal.

~~It's not two different integers,~~ it's that integers (which all <=4 bytes pushes are assumed to be) are displayed in decimal and all other pushes are displayed as hexadecimal.

I wonder why we even try to detect and print numbers differently (and even more so in a different base!), just print all pushes in hex and there is no ambiguity anymore?

MatthewLM · 2023-06-01T14:21:17Z

They are two different numbers. One is 0x1112131415 and the other is 0x57c74942 (encoded as little-endian). In ASM, they are displayed the same. Using 0x before hex is consistent with hex literals elsewhere and would allow for decimals to be displayed including -1 for OP_1NEGATE.

Displaying all numbers as hex is an option. One problem is that they are given as little-endian which could be confusing for some wishing to interpret them. Off the top of my memory, the scripting interpreter treats 32-bit numbers arithmetically so I think it makes sense to display those separately. The only difference should be that hex and decimal should be distinguished.

darosior · 2023-06-01T14:28:18Z

Sure, i misspoke. I just meant:

>>> int.from_bytes(bytes.fromhex("57c74942"), "little")
1112131415

willcl-ark · 2023-06-02T08:59:41Z

It's a little unclear to me how the ASM output is used on the consumer side, and therefore how breaking any changes to it might be. However I still think we have a few options to resolve this:

Fully hex-encoded output in the ASM repr:

$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
{
  "asm": "1112131415 57c74942",

Prefixing hex values with 0x (we would end up with a lot of prefixes in this case):

$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
{
  "asm": "0x1112131415 1112131415",

Prefixing decimal values with 0d (only prefixing small pushes):

$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
{
  "asm": "1112131415 0d1112131415",

As I think Antoine inticated, I'd have a preference for "full hex". However I think there may be a small quality of life advantage to prefixing decimal values for persons manually crafting or reading scripts using ASM notation, as e.g. timelock values are most easily thought about in decimal.

Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

Closes: bitcoin#27795 Previously ScriptToAsmStr returned hex-encoded integers, except if data length was <= 4 bytes, in which case it displayed using decimal encoding. This was originally added by Satoshi in 4405b78/script.h#L298-L305 Remove the decimal carve-out for small pushes.

MatthewLM · 2023-06-05T12:37:38Z

Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

It is because any numerical/arithmetic operations only allow 32-bit values (please see the CScriptNum class), so the ASM displays 32-bit values as decimal and longer values as hex. This makes sense as humans generally better comprehend decimal values when thinking arithmetically, but the values need to be distinguishable.

willcl-ark · 2023-06-05T14:11:48Z

Well, they would still be 32 bit values, just hex-encoded? But I take the point. I personally think prefixing hex with 0x would be kind of horrible, as this would cascade into including pubkeys for example.

Is prefixing decimals with 0d the best approach then?

I am still unclear on how much impact this would have downstream, and therefore whether any of these discussed changes would be accepted in any form. It seems like this could break much of the little scripting tooling we have available for bitcoin...

MatthewLM · 2023-06-05T15:20:15Z

Is it a problem prefixing public keys with 0x? I'm not overly fussed on the solution as long as the ambiguity is removed.

It seems like this could break much of the little scripting tooling we have available for bitcoin

If these tools read these ambiguous numbers/data from ASM, then they are already broken, so this needs fixing one way or another.

ajtowns · 2023-08-25T07:03:05Z

Maybe ajtowns@36cbf11 ? It adds a "0x" if the hex is ambiguous (ie none of the digits are a-f). Presumably almost all pubkeys will not be ambiguous (1-in-12-trillion chance) so the 0x won't get added there and confuse people.

$ ./bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"0x1112131415 1112131415"
$ ./bitcoin-cli -regtest decodescript 0511121c14150457c74942 | jq .asm
"11121c1415 1112131415"

Another approach could be to add the prefix for hex strings of 5 bytes (10 hex digits) -- anything that would decode to 8 or fewer hex digits would be decoded as decimal anyway, and anything with >11 decimal digits would be more than 4 bytes originally so would not have been decoded as decimal.

Closes: bitcoin#27795 Closes: bitcoin#7996 Previously ScriptToAsmStr returned hex-encoded integers, except if data length was <= 4 bytes, in which case it displayed using decimal encoding. This was originally added by Satoshi in 4405b78/script.h#L298-L305 Remove the decimal carve-out for small pushes.

luke-jr · 2023-11-08T21:56:37Z

Even aside from the main issue, it's still ambiguous for the various opcode possibilities... :/

Closes: bitcoin#27795 Closes: bitcoin#7996 Previously ScriptToAsmStr returned hex-encoded integers, except if data length was <= 4 bytes, in which case it displayed using decimal encoding. This was originally added by Satoshi in 4405b78/script.h#L298-L305 Remove the decimal carve-out for small pushes. Github-Pull: bitcoin#28824 Rebased-From: d6c0c4b

sipa · 2024-05-08T14:16:23Z

The discussion here led to #28824 being opened, where more discussion happened, and the idea evolved further. Having re-read things there, I have a concrete proposal that perhaps goes a bit further, so I'm posting it here to keep PR discussion about the implementation.

The goal is addressing ambiguity and readability, but not necessarily compatibility (because being compatibility with something broken is silly).

Decoding rules

Even if all we expose on the RPC/user side is encoding, I believe it's useful to formally specify the decoding rules, as they help understand how the encoding can be unambiguous. If this is implemented, I think we want the decoder written too, so that e.g. round-trip fuzz tests can be added for the encoding.

If the asm string in its entirety is an even length hex string, it is decoded raw (i.e. script.size() == asm.size() / 2 exactly).
Otherwise, the script is interpreted as consisting of 1 or more whitespace-separated units, which are each decoded as listed below and concatenated:
- An opcode name, excluding OP_PUSHDATAn, with or without OP_ prefix, is decoded to the single opcode byte.
- A decimal number, optionally prefixed with + or -, is decoded as the script consisting of the minimum push of the minimal encoding of that number (i.e. OP_1NEGATE, OP_0 through OP_16, or direct push), assuming that encoding is no more than 5 bytes. For 0 through 16 this overlaps with the previous rule, but since they mean the same thing this is ok.
- A string surrounded by < and > means a direct push or OP_PUSHDATAx push (whichever is smaller) of the recursively-decoded string between angle brackets. So that string can be just hex data, or it can be another sequence of units.
- A string surrounded by PUSHDATAx< and > means a push using the specified OP_PUSHDATAx push of the recursively-decoded string between angle brackets.
- An even length hex string prefixed with "#" means raw hex decoding that string.

Notably, this enables recursive encoding (e.g. a 2-of-3 multisig P2SH scriptSig could become 0 <sig1> <sig3> <2 <pub1> <pub2> <pub3> CHECKMULTISIG>) .

The ambiguity of asm scripts that consists entirely of an even number of decimal characters is resolved in favor of treating them as raw hex rather than as a decimal integer, because:

Scripts that entirely consist of a single push are uninteresting and presumably rare (they're anyone-can-spend).
This means that the rules for decoding what's inside <> are exactly the same as decoding scripts.

Compared to the earlier discussion, it uses # instead of UNPARSABLE:, GARBAGE:, RAW() or any of those. Happy to bikeshed this; it shouldn't occur frequently.

It drops 0x things entirely, because I think the ones we have in the asm format currently are nonsensical (sometimes they are raw bytes, sometimes they are pushes, but there they use raw byte order while 0x to me would generally mean human-ordered (big endian) numbers (as they do in C and Python). If people feel compatibility with the 0x stuff is desirable, it could be added back, but I'd rather avoid it. Alternatively, 0x for numeric things would make more sense to me (e.g. 0x12345 meaning the script whose hex bytes are "03 45 23 01"), but that's an even bigger break with what we had before.

Encoding rules

The format allows multiple possible encodings for the same thing (e.g. script byte 00 aka OP_0 can be written as <>, 0, OP_0, or #00), so the encoding is not unique. This is a suggested approach:

For pushes, try in order:
- If just a decimal integer encodes the right thing, use that. This is always the case for (among others) OP_1NEGATE, and OP_0 through OP_16. A + or OP_ in front is only needed if the entire script consists of just a single push whose decimal encoding would have even length otherwise (because confusion with entirely hex-encoded string rule).
- If the push is minimal otherwise, use < + hex encoding of the pushed data + >. If the entire script is push-only, and the last push looks like a script itself, use < + disassembly of the pushed script + > instead.
- Use PUSHDATAx< + hex encoding of pushed data + > otherwise.
Other opcodes are just turned into their name.
Unparsable data becomes # + raw hex encoding.

If everything, or a large fraction of the script, is unparsable just hex encode instead.

MatthewLM added the Feature label May 31, 2023

maflcko added the RPC/REST/ZMQ label May 31, 2023

willcl-ark linked a pull request Nov 8, 2023 that will close this issue

Fix ASM ambiguity #28824

Draft

sipa mentioned this issue May 9, 2024

decoderawtransaction should use hex or decimal, not both #30067

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

MatthewLM commented May 31, 2023

fanquake commented May 31, 2023

darosior commented Jun 1, 2023 •

edited

MatthewLM commented Jun 1, 2023 •

edited

darosior commented Jun 1, 2023

willcl-ark commented Jun 2, 2023 •

edited

MatthewLM commented Jun 5, 2023 •

edited

willcl-ark commented Jun 5, 2023

MatthewLM commented Jun 5, 2023

ajtowns commented Aug 25, 2023

luke-jr commented Nov 8, 2023

sipa commented May 8, 2024 •

edited

Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

Comments

MatthewLM commented May 31, 2023

Please describe the feature you'd like to see added.

Is your feature related to a problem, if so please describe it.

Describe the solution you'd like

Describe any alternatives you've considered

Please leave any additional context

fanquake commented May 31, 2023

darosior commented Jun 1, 2023 • edited

MatthewLM commented Jun 1, 2023 • edited

darosior commented Jun 1, 2023

willcl-ark commented Jun 2, 2023 • edited

MatthewLM commented Jun 5, 2023 • edited

willcl-ark commented Jun 5, 2023

MatthewLM commented Jun 5, 2023

ajtowns commented Aug 25, 2023

luke-jr commented Nov 8, 2023

sipa commented May 8, 2024 • edited

Decoding rules

Encoding rules

darosior commented Jun 1, 2023 •

edited

MatthewLM commented Jun 1, 2023 •

edited

willcl-ark commented Jun 2, 2023 •

edited

MatthewLM commented Jun 5, 2023 •

edited

sipa commented May 8, 2024 •

edited