Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

Open
MatthewLM opened this issue May 31, 2023 · 11 comments · May be fixed by #28824
Open

Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

MatthewLM opened this issue May 31, 2023 · 11 comments · May be fixed by #28824

Comments

@MatthewLM
Copy link

Please describe the feature you'd like to see added.

Distinguish between decimal and hex integers in the script ASM to avoid ambiguity. Hex integers/data can be prefixed with 0x to avoid ambiguity.

Is your feature related to a problem, if so please describe it.

When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal. decodescript 0511121314150457c74942 produces ASM of 1112131415 1112131415 , despite the integers being different. There is no distinction between hex and decimal.

Describe the solution you'd like

Prefixing 0x makes the most sense to me.

Describe any alternatives you've considered

All data could be provided as hex, or "d" could be added to decimals. Removing ambiguity is the primary concern here.

Please leave any additional context

No response

@fanquake
Copy link
Member

Related to #7996.

@darosior
Copy link
Member

darosior commented Jun 1, 2023

When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal.

It's not two different integers, it's that integers (which all <=4 bytes pushes are assumed to be) are displayed in decimal and all other pushes are displayed as hexadecimal.

I wonder why we even try to detect and print numbers differently (and even more so in a different base!), just print all pushes in hex and there is no ambiguity anymore?

@MatthewLM
Copy link
Author

MatthewLM commented Jun 1, 2023

They are two different numbers. One is 0x1112131415 and the other is 0x57c74942 (encoded as little-endian). In ASM, they are displayed the same. Using 0x before hex is consistent with hex literals elsewhere and would allow for decimals to be displayed including -1 for OP_1NEGATE.

Displaying all numbers as hex is an option. One problem is that they are given as little-endian which could be confusing for some wishing to interpret them. Off the top of my memory, the scripting interpreter treats 32-bit numbers arithmetically so I think it makes sense to display those separately. The only difference should be that hex and decimal should be distinguished.

@darosior
Copy link
Member

darosior commented Jun 1, 2023

Sure, i misspoke. I just meant:

>>> int.from_bytes(bytes.fromhex("57c74942"), "little")
1112131415

@willcl-ark
Copy link
Member

willcl-ark commented Jun 2, 2023

It's a little unclear to me how the ASM output is used on the consumer side, and therefore how breaking any changes to it might be. However I still think we have a few options to resolve this:

  1. Fully hex-encoded output in the ASM repr:
$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
{
  "asm": "1112131415 57c74942",
  1. Prefixing hex values with 0x (we would end up with a lot of prefixes in this case):
$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
{
  "asm": "0x1112131415 1112131415",
  1. Prefixing decimal values with 0d (only prefixing small pushes):
$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
{
  "asm": "1112131415 0d1112131415",

As I think Antoine inticated, I'd have a preference for "full hex". However I think there may be a small quality of life advantage to prefixing decimal values for persons manually crafting or reading scripts using ASM notation, as e.g. timelock values are most easily thought about in decimal.

Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

willcl-ark added a commit to willcl-ark/bitcoin that referenced this issue Jun 5, 2023
Closes: bitcoin#27795

Previously ScriptToAsmStr returned hex-encoded integers, except if data
length was <= 4 bytes, in which case it displayed using decimal
encoding.

This was originally added by Satoshi in
4405b78/script.h#L298-L305

Remove the decimal carve-out for small pushes.
@MatthewLM
Copy link
Author

MatthewLM commented Jun 5, 2023

Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

It is because any numerical/arithmetic operations only allow 32-bit values (please see the CScriptNum class), so the ASM displays 32-bit values as decimal and longer values as hex. This makes sense as humans generally better comprehend decimal values when thinking arithmetically, but the values need to be distinguishable.

@willcl-ark
Copy link
Member

Well, they would still be 32 bit values, just hex-encoded? But I take the point. I personally think prefixing hex with 0x would be kind of horrible, as this would cascade into including pubkeys for example.

Is prefixing decimals with 0d the best approach then?

I am still unclear on how much impact this would have downstream, and therefore whether any of these discussed changes would be accepted in any form. It seems like this could break much of the little scripting tooling we have available for bitcoin...

@MatthewLM
Copy link
Author

Is it a problem prefixing public keys with 0x? I'm not overly fussed on the solution as long as the ambiguity is removed.

It seems like this could break much of the little scripting tooling we have available for bitcoin

If these tools read these ambiguous numbers/data from ASM, then they are already broken, so this needs fixing one way or another.

@ajtowns
Copy link
Contributor

ajtowns commented Aug 25, 2023

Maybe ajtowns@36cbf11 ? It adds a "0x" if the hex is ambiguous (ie none of the digits are a-f). Presumably almost all pubkeys will not be ambiguous (1-in-12-trillion chance) so the 0x won't get added there and confuse people.

$ ./bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
"0x1112131415 1112131415"
$ ./bitcoin-cli -regtest decodescript 0511121c14150457c74942 | jq .asm
"11121c1415 1112131415"

Another approach could be to add the prefix for hex strings of 5 bytes (10 hex digits) -- anything that would decode to 8 or fewer hex digits would be decoded as decimal anyway, and anything with >11 decimal digits would be more than 4 bytes originally so would not have been decoded as decimal.

willcl-ark added a commit to willcl-ark/bitcoin that referenced this issue Nov 8, 2023
Closes: bitcoin#27795
Closes: bitcoin#7996

Previously ScriptToAsmStr returned hex-encoded integers, except if data
length was <= 4 bytes, in which case it displayed using decimal
encoding.

This was originally added by Satoshi in
4405b78/script.h#L298-L305

Remove the decimal carve-out for small pushes.
willcl-ark added a commit to willcl-ark/bitcoin that referenced this issue Nov 8, 2023
Closes: bitcoin#27795
Closes: bitcoin#7996

Previously ScriptToAsmStr returned hex-encoded integers, except if data
length was <= 4 bytes, in which case it displayed using decimal
encoding.

This was originally added by Satoshi in
4405b78/script.h#L298-L305

Remove the decimal carve-out for small pushes.
@willcl-ark willcl-ark linked a pull request Nov 8, 2023 that will close this issue
@luke-jr
Copy link
Member

luke-jr commented Nov 8, 2023

Even aside from the main issue, it's still ambiguous for the various opcode possibilities... :/

luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this issue Nov 9, 2023
Closes: bitcoin#27795
Closes: bitcoin#7996

Previously ScriptToAsmStr returned hex-encoded integers, except if data
length was <= 4 bytes, in which case it displayed using decimal
encoding.

This was originally added by Satoshi in
4405b78/script.h#L298-L305

Remove the decimal carve-out for small pushes.

Github-Pull: bitcoin#28824
Rebased-From: d6c0c4b
@sipa
Copy link
Member

sipa commented May 8, 2024

The discussion here led to #28824 being opened, where more discussion happened, and the idea evolved further. Having re-read things there, I have a concrete proposal that perhaps goes a bit further, so I'm posting it here to keep PR discussion about the implementation.

The goal is addressing ambiguity and readability, but not necessarily compatibility (because being compatibility with something broken is silly).

Decoding rules

Even if all we expose on the RPC/user side is encoding, I believe it's useful to formally specify the decoding rules, as they help understand how the encoding can be unambiguous. If this is implemented, I think we want the decoder written too, so that e.g. round-trip fuzz tests can be added for the encoding.

  • If the asm string in its entirety is an even length hex string, it is decoded raw (i.e. script.size() == asm.size() / 2 exactly).
  • Otherwise, the script is interpreted as consisting of 1 or more whitespace-separated units, which are each decoded as listed below and concatenated:
    • An opcode name, excluding OP_PUSHDATAn, with or without OP_ prefix, is decoded to the single opcode byte.
    • A decimal number, optionally prefixed with + or -, is decoded as the script consisting of the minimum push of the minimal encoding of that number (i.e. OP_1NEGATE, OP_0 through OP_16, or direct push), assuming that encoding is no more than 5 bytes. For 0 through 16 this overlaps with the previous rule, but since they mean the same thing this is ok.
    • A string surrounded by < and > means a direct push or OP_PUSHDATAx push (whichever is smaller) of the recursively-decoded string between angle brackets. So that string can be just hex data, or it can be another sequence of units.
    • A string surrounded by PUSHDATAx< and > means a push using the specified OP_PUSHDATAx push of the recursively-decoded string between angle brackets.
    • An even length hex string prefixed with "#" means raw hex decoding that string.

Notably, this enables recursive encoding (e.g. a 2-of-3 multisig P2SH scriptSig could become 0 <sig1> <sig3> <2 <pub1> <pub2> <pub3> CHECKMULTISIG>) .

The ambiguity of asm scripts that consists entirely of an even number of decimal characters is resolved in favor of treating them as raw hex rather than as a decimal integer, because:

  • Scripts that entirely consist of a single push are uninteresting and presumably rare (they're anyone-can-spend).
  • This means that the rules for decoding what's inside <> are exactly the same as decoding scripts.

Compared to the earlier discussion, it uses # instead of UNPARSABLE:, GARBAGE:, RAW() or any of those. Happy to bikeshed this; it shouldn't occur frequently.

It drops 0x things entirely, because I think the ones we have in the asm format currently are nonsensical (sometimes they are raw bytes, sometimes they are pushes, but there they use raw byte order while 0x to me would generally mean human-ordered (big endian) numbers (as they do in C and Python). If people feel compatibility with the 0x stuff is desirable, it could be added back, but I'd rather avoid it. Alternatively, 0x for numeric things would make more sense to me (e.g. 0x12345 meaning the script whose hex bytes are "03 45 23 01"), but that's an even bigger break with what we had before.

Encoding rules

The format allows multiple possible encodings for the same thing (e.g. script byte 00 aka OP_0 can be written as <>, 0, OP_0, or #00), so the encoding is not unique. This is a suggested approach:

  • For pushes, try in order:
    • If just a decimal integer encodes the right thing, use that. This is always the case for (among others) OP_1NEGATE, and OP_0 through OP_16. A + or OP_ in front is only needed if the entire script consists of just a single push whose decimal encoding would have even length otherwise (because confusion with entirely hex-encoded string rule).
    • If the push is minimal otherwise, use < + hex encoding of the pushed data + >. If the entire script is push-only, and the last push looks like a script itself, use < + disassembly of the pushed script + > instead.
    • Use PUSHDATAx< + hex encoding of pushed data + > otherwise.
  • Other opcodes are just turned into their name.
  • Unparsable data becomes # + raw hex encoding.

If everything, or a large fraction of the script, is unparsable just hex encode instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants