Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: precise tagging of scalar values #4369

Merged
merged 141 commits into from
Feb 23, 2024

Conversation

crusso
Copy link
Contributor

@crusso crusso commented Jan 24, 2024

The Motoko runtime representation of values is largely untyped, distinguishing only between scalar and boxed values
using a single bit of the 32-bit value representation. The tagging is only to support garbage collection, not precise runtime type information.

In the existing value encoding, a Motoko value in vanilla form is a 32-bit value that is either:

  • false (0b0),
  • true (0b1),
  • a word-aligned (encoded) pointer to a heap allocated value.
    Encoded by subtracting 1 from the pointer value (ensuring the 2 LSBs are 0b11), pointing
    heap allocated value
  • null (some well-known skewed pointer).
  • a 31-bit scalar value, stored in the top bits of the value with LSB 0.

Scalar values encode Nat8/16 and Int8/16 values and chars, and 31-bit subranges of Nat32, Int32, Nat64, Int64, Nat and Int. Large integer values that don't fit in a 31-bit scalar are boxed on the heap.

Observe that, in Motoko, some types are always scalar (eg. Nat8), some types are always boxed (e.g. Blob), and some types have a mixed scalar/boxed representation (e.g. Nat32 and Nat), depending on the size of the value.

This PR adds exact runtime type information to all[*] scalar values, making the scalar values self describing.
Making the entire heap fully self-describing requires refining the heap tags use to identify heap objects, distinguishing boxed Nat32 from boxed Int32, Blob from Principal and Text, tuples from (mutable and immutable) arrays etc. That work of refining heap tags will need to be completed in a follow on or sibling PR, but is hopefully less involved than the changes herein.

To add precise scalar type info, we extend the scalar tagging scheme with a richer set of (inline) type descriptors, using some of the least significant bits of the 31-bit scalar representation.

To avoid dedicating a fix-length suffix (say 1 byte) to the scalar tag, scalar tags are actually variable length, using shorter tags for larger payload types, and longer tags for shorter payload types. This gives us a reasonable tag space (set of possible tags, some still unused), without reducing the scalar range of mixed representation types too much.

At one extreme, the tag of Int (and Nat) is just 0b10, leaving a 30-bit payload for compact Nat/Int, losing just 1 bit from the current representation's 31-bit compact range. This is important because Ints are common, and Nats are used to index arrays, so we should avoid boxing more than necessary.

In the middle, the tag of Nat16, Int16 is 0b10(0^12)00 and 0b11(0^12)00, leaving a 16-bit payload in the MSB.

At the other extreme, the tag of the unit value, (), is 32-bit 0x01(0^28)00, occupying the entire value.

The primary motivation of this work is to support value, not type driven, serialization of stable values to a precisely typed stable format, without loss of type information, so that upgrades can still accommodate type dependent changes of representation from one in-memory format to another. Secondary motivations are live and post-mortem heap inspection tools and light-weight debugging tools, that can parse values in locals, arguments and on the heap using tags.

[*] There remain some raw, untagged 31-bit scalars whose type is only known to the compiler. These are used to encode the state of text and blob iterators, hidden in dedicated iterator closure environments. Note that these are not stable types, so need not be precisely tagged for stabilization.

Tagging Scheme

Value Type Payload bits
((O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O)) TBool (* false *) 0
((O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,I)) TBool (* true *) 0
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,I)) TRef 30
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,O)) TNum 30
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,O,I,O,O)) TNat64 28
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,I,I,O,O)) TInt64 28
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,O,I,O,O,O)) TNat32 27
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,I,I,O,O,O)) TInt32 27
... unused tags .... ... ...
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,O,I,O), (O,O,O,O,O,O,O,O)) TChar 21
... unused tags .... ... ...
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O)) TNat16 16
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (I,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O)) TInt16 16
... unused tags .... ... ...
((_,_,_,_,_,_,_,_), (O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O)) TNat8 8
((_,_,_,_,_,_,_,_), (I,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O)) TInt8 8
... unused tags .... ... ...
((O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O)) TUnit 0

Implementation

The implementation was carried out in a number of precursor PRs:

Overheads

These are the cycle count and code size differences measured using test/bench and test/perf, compared against master (see spreadsheet for perf of interim PRs).

Summarized from:

https://docs.google.com/spreadsheets/d/1zC2Hsl9gGUzJESQmSABPiu-XIsICEw1I3O-JKHNWVQs/edit?usp=sharing

perf

test/perf

Master     Widening   Widening vs Master   Gated   Gated vs Master
gas/assetstorage 10013950   gas/assetstorage 10013950 0.00%   gas/assetstorage 10013950 0.00%
size/assetstorage 186455   size/assetstorage 186705 0.13%   size/assetstorage 186520 0.03%
gas/dao 4413634512   gas/dao 4413744976 0.00%   gas/dao 4413743944 0.00%
size/dao 265797   size/dao 266385 0.22%   size/dao 265922 0.05%
gas/qr 1302744688   gas/qr 1305067118 0.18%   gas/qr 1302750018 0.00%
size/qr 256049   size/qr 256925 0.34%   size/qr 256285 0.09%
gas/reversi 80920993   gas/reversi 81019001 0.12%   gas/reversi 80927129 0.01%
size/reversi 175956   size/reversi 176421 0.26%   size/reversi 176084 0.07%
gas/sha224 460197621   gas/sha224 498978947 8.43%        
size/sha224 191929   size/sha224 192859 0.48%        
gas/sha256 14487063673   gas/sha256 15568532694 7.47%   gas/sha256 14486916565 0.00%
size/sha256 179075   size/sha256 180167 0.61%   size/sha256 179223 0.08%

test/bench

Master     Widening   Widening vs Master   Gated   Gated vs Master
gas/alloc 9,243,068,120.00   gas/alloc 10,350,366,461.00 11.98%   gas/alloc 9243068126 0.00%
size/alloc 181,066.00   size/alloc 180,759.00 -0.17%   size/alloc 180464 -0.33%
gas/bignum 130,604,743.00   gas/bignum 130,606,013.00 0.00%   gas/bignum 130604779 0.00%
size/bignum 184,420.00   size/bignum 184,093.00 -0.18%   size/bignum 183790 -0.34%
gas/heap-32 1,610,218,447.00   gas/heap-32 1,695,702,521.00 5.31%   gas/heap-32 1609469958 -0.05%
size/heap-32 182,167.00   size/heap-32 181,856.00 -0.17%   size/heap-32 181556 -0.34%
gas/nat16 61,393,031.00   gas/nat16 65,587,813.00 6.83%   gas/nat16 61393019 0.00%
size/nat16 181,010.00   size/nat16 180,727.00 -0.16%   size/nat16 180408 -0.33%
gas/palindrome 10,131,340.00   gas/palindrome 10,133,866.00 0.02%   gas/palindrome 10131268 0.00%
size/palindrome 185,338.00   size/palindrome 185,024.00 -0.17%   size/palindrome 184695 -0.35%
gas/region0-mem 6,402,149,937.00   gas/region0-mem 6,452,495,054.00 0.79%   gas/region0-mem 6402149955 0.00%
size/region0-mem 181,898.00   size/region0-mem 181,602.00 -0.16%   size/region0-mem 181281 -0.34%
gas/region-mem 5,974,331,587.00   gas/region-mem 6,024,676,752.00 0.84%   gas/region-mem 5974331605 0.00%
size/region-mem 181,539.00   size/region-mem 181,252.00 -0.16%   size/region-mem 180931 -0.33%
gas/stable-mem 3,885,566,188.00   gas/stable-mem 3,935,898,195.00 1.30%   gas/stable-mem 3885566206 0.00%
size/stable-mem 181,896.00   size/stable-mem 181,600.00 -0.16%   size/stable-mem 181279 -0.34%
gas/xxx-nat32 57,198,791.00   gas/xxx-nat32 57,199,237.00 0.00%   gas/xxx-nat32 57198779 0.00%
size/xxx-nat32 181,001.00   size/xxx-nat32 180,694.00 -0.17%   size/xxx-nat32 180399 -0.33%

Copy link

github-actions bot commented Feb 13, 2024

Download the artifacts for this pull request:

@luc-blaeser
Copy link
Contributor

luc-blaeser commented Feb 15, 2024

I remeasured the performance in GC benchmark with the latest tagging changes: https://github.com/luc-blaeser/gcbench

Generally low or moderate performance regression.

Instructions

Total instructions consumed, using copying GC

Scenario Original Tagged Difference
asset-storage 2.04E+08 2.04E+08 0
blobs 1.52E+10 1.52E+10 0
btree-map 5.35E+10 5.79E+10 +8%
buffer 2.53E+10 2.71E+10 +7%
cancan 4.43E+10 5.05E+10 +14%
extendable-token 3.39E+06 3.41E+06 +1%
game-of-life 2.33E+07 2.33E+07 0
graph 1.92E+10 2.03E+10 +6%
imperative-rb-tree 3.55E+10 3.77E+10 +6%
linked-list 3.75E+10 3.99E+10 +6%
qr-code 1.16E+09 1.16E+09 0
random-maze 2.72E+08 2.92E+08 +7%
rb-tree 1.76E+10 1.87E+10 +6%
reversi 2.59E+07 2.86E+07 +10%
scalable-buffer 6.75E+10 7.20E+10 +7%
sha256 3.58E+10 3.85E+10 +8%
trie-map 3.29E+10 3.54E+10 +8%
Average 2.27E+10 2.44E+10 +7%

Memory Size

No changes since the last measurement. Except for one case (game-of-life, +8%) no relevant Wasm allocated memory difference compared to master.

@crusso
Copy link
Contributor Author

crusso commented Feb 15, 2024

Excellent - thanks for that!

Copy link
Contributor

@ggreif ggreif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy shit! This is a gargantuan one. I didn't see anything suspicious, but also my eyesight is not as good as in my youth years :-)

Obviously mandatory unboxing/reboxing makes arg1 ⨶ arg2-style simple arithmetic very expensive. A pity that such simple ones are the rule and not the exception.

I see the point that we need tagging if we want graph-serialisation and 64-bit heap cells. But when will that come?

Anyway I suggest a somewhat extended beta phase to get some experience with this.

let mp_int = p.as_bigint().mp_int_ptr();
mp_get_double(mp_int)
}
debug_assert!(!p.is_scalar());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit surprising on first sight, but understandable if Int and Nat have different tags, Rust cannot untag without context. But they are the same: IO! So why this change?
Is get_signed_scalar dead now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RTS only knows about untagged 31 bit scalars, but our compact bignums are actually 30 bit. The idea here is to shift the switch on compact vs boxed to compile.ml, where the scalar tagging scheme is known. That also make it easier to change in the future without touching the rts. So this function should now only be called on proper heap allocated bigints.

} *)
let last = arr.size(e1) : Int - 1 ;
var indx = 0;
if (last == -1) { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< 0 is probably a faster test, as -1 is the only negative outcome. But I see, the operation you feed it in is now EqArrayOffset. Oh well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only done once. Does it matter? We can always improve later. I just didn't want to ever exceed the range of (now) 30 bit compact bignum.

src/mo_values/prim.ml Show resolved Hide resolved
src/codegen/compile.ml Outdated Show resolved Hide resolved
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,O)) -> TNum (* 30 bit *)
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,O,I,O,O)) -> TNat64 (* 28 bit *)
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,I,I,O,O)) -> TInt64
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,O,I,O,O,O)) -> TNat32 (* 27 bit *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The utility of this tagging scheme will definitely manifest itself when doing in-heap upgrade from 32 to 64 bit values (or on graph deserialisation for that matter; but Candid deserialisation has no issues in this regard). In absence of such tags there is no way to know if a scalar is Nat8 or Nat16, but they need different treatment (left shifts).

@@ -2434,13 +2754,13 @@ module TaggedSmallWord = struct
| Type.(Int32|Nat32) -> G.nop
| Type.(Nat8|Nat16) as ty -> compile_shrU_const (shift_of_type ty)
| Type.(Int8|Int16) as ty -> compile_shrS_const (shift_of_type ty)
| Type.Char as ty -> compile_shrU_const (shift_of_type ty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, was this missing?

Copy link
Contributor Author

@crusso crusso Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not because we always had and used dedicated untag_codepoint/tag_codepoint functions.

src/codegen/compile.ml Outdated Show resolved Hide resolved
@@ -3170,51 +3513,84 @@ module MakeCompact (Num : BigNumType) : BigNumType = struct
try_unbox I32Type (fun _ -> match n with
| 32 | 64 -> G.i Drop ^^ Bool.lit true
| 8 | 16 ->
compile_bitand_const Int32.(logor 1l (shift_left minus_one (n + 1))) ^^
(* Please review carefully! *)
compile_bitand_const Int32.(logor 1l (shift_left minus_one (n + (32 - BitTagged.ubits_of Type.Int)))) ^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if_tagged_scalar only chooses this fast path when the bottom bit is 0, so the logor 1l is indeed redundant.

src/codegen/compile.ml Outdated Show resolved Hide resolved
src/codegen/compile.ml Outdated Show resolved Hide resolved
_arithmetic_ right shift. This is the right thing to do for signed
numbers, and because we want to apply a consistent logic for all types,
especially as there is only one wasm type, we use the same range for
signed numbers as well.
Copy link
Contributor Author

@crusso crusso Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was that a typo?

"we use the same range for signed numbers as well." should have read
"we use the same range for unsigned numbers as well."

That always confused me....

Not, we no longer do this except for Nat and Int, allowing an extra bit for unsigned compact numbers (compact Nat32/Nat64)

src/codegen/compile.ml Outdated Show resolved Hide resolved
src/codegen/compile.ml Outdated Show resolved Hide resolved
Co-authored-by: Gabor Greif <gabor@dfinity.org>
@crusso
Copy link
Contributor Author

crusso commented Feb 16, 2024

Holy shit! This is a gargantuan one. I didn't see anything suspicious, but also my eyesight is not as good as in my youth years :-)

Thanks for the review!

It was even bigger before I used the typed stackrep to do the tagging and untagging in one place... There each operation had to strip and add the tags, which was many changes...

Obviously mandatory unboxing/reboxing makes arg1 ⨶ arg2-style simple arithmetic very expensive. A pity that such simple ones are the rule and not the exception.

I guess we could measure that and see the diff. My guess it that it is 3 extra instructions per op but less for compound expressions.

I see the point that we need tagging if we want graph-serialisation and 64-bit heap cells. But when will that come?

We could disable it for now, by just changing the TaggingScheme module and using all-zero tags, but then we would need to keep candid serialization around for the final upgrade to a tagged world. I think @luc-blaeser has already ripped that out in his successor PR adding graph serialization deserialization, leaving just candid deserialization for the migration. But I could be wrong.

Anyway I suggest a somewhat extended beta phase to get some experience with this.

Yes maybe we should make a release and ask people to try it without including in dfx for a while.

@crusso
Copy link
Contributor Author

crusso commented Feb 16, 2024

@ggreif @luc-blaeser

What should I do about the filecheck tests? I could modify them for the new tagging scheme (but then we can't disable it easily) or just leave as is.

They feel pretty fragile to me.

@crusso
Copy link
Contributor Author

crusso commented Feb 18, 2024

@ggreif @luc-blaeser I started PR #4400 to toggle the scalar tagging via a flag in case we are nervous. Widening the compact integer ranges back to 31 bits (from 30/28/26 for Nat/Int, Nat64/Int64 Nat32/Int32) revealed what I think was a bug in the if_can_tag_i32/64 tests and sanity checks.

When ubits pty is 31, I think the broken tests would just clear the signs bits and thus always succeed and tag the numbers, even when out of range.
When ubits pty was less than 31, it think it would fail to xor the lowest sign-bit, and possibly box more than necessary (not entirely sure, tbh).

I've fixed the bug here 537876b
but would appreciate a careful second look from both of you.
I had to rewrite the sanity checks to handle the case when ubits pty = 31 since the signed comparisons would no longer work for the unsigned range.

I think the bug was benign for correctness, but would box negative numbers that are actually in the compact range, probably even small ones!

* add flag to enable rtti

* fix bugs in can_tag_i32/i64 tests and sanity checks

* adjust test assert on heap size

* update perf numbers

* revert change

* revert test

* optimized clearing of all-zero tags

* update perf numbers
@crusso crusso added automerge-squash When ready, merge (using squash) and removed automerge-squash When ready, merge (using squash) DO-NOT-MERGE build_artifacts Upload moc binary as workflow artifacts labels Feb 23, 2024
@mergify mergify bot merged commit 3f3af73 into master Feb 23, 2024
11 checks passed
@mergify mergify bot deleted the claudio/small-tags-final-untagged-widened branch February 23, 2024 20:56
@mergify mergify bot removed the automerge-squash When ready, merge (using squash) label Feb 23, 2024
luc-blaeser added a commit that referenced this pull request Feb 29, 2024
luc-blaeser added a commit that referenced this pull request Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants