-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: precise tagging of scalar values #4369
Conversation
…laudio/small-tags
Download the artifacts for this pull request: |
I remeasured the performance in GC benchmark with the latest tagging changes: https://github.com/luc-blaeser/gcbench Generally low or moderate performance regression. InstructionsTotal instructions consumed, using copying GC
Memory SizeNo changes since the last measurement. Except for one case ( |
Excellent - thanks for that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holy shit! This is a gargantuan one. I didn't see anything suspicious, but also my eyesight is not as good as in my youth years :-)
Obviously mandatory unboxing/reboxing makes arg1 ⨶ arg2
-style simple arithmetic very expensive. A pity that such simple ones are the rule and not the exception.
I see the point that we need tagging if we want graph-serialisation and 64-bit heap cells. But when will that come?
Anyway I suggest a somewhat extended beta phase to get some experience with this.
let mp_int = p.as_bigint().mp_int_ptr(); | ||
mp_get_double(mp_int) | ||
} | ||
debug_assert!(!p.is_scalar()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit surprising on first sight, but understandable if Int
and Nat
have different tags, Rust cannot untag without context. But they are the same: IO
! So why this change?
Is get_signed_scalar
dead now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RTS only knows about untagged 31 bit scalars, but our compact bignums are actually 30 bit. The idea here is to shift the switch on compact vs boxed to compile.ml, where the scalar tagging scheme is known. That also make it easier to change in the future without touching the rts. So this function should now only be called on proper heap allocated bigints.
} *) | ||
let last = arr.size(e1) : Int - 1 ; | ||
var indx = 0; | ||
if (last == -1) { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
< 0
is probably a faster test, as -1
is the only negative outcome. But I see, the operation you feed it in is now EqArrayOffset
. Oh well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only done once. Does it matter? We can always improve later. I just didn't want to ever exceed the range of (now) 30 bit compact bignum.
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,O)) -> TNum (* 30 bit *) | ||
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,O,I,O,O)) -> TNat64 (* 28 bit *) | ||
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,I,I,O,O)) -> TInt64 | ||
| ((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,O,I,O,O,O)) -> TNat32 (* 27 bit *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utility of this tagging scheme will definitely manifest itself when doing in-heap upgrade from 32 to 64 bit values (or on graph deserialisation for that matter; but Candid deserialisation has no issues in this regard). In absence of such tags there is no way to know if a scalar is Nat8
or Nat16
, but they need different treatment (left shifts).
@@ -2434,13 +2754,13 @@ module TaggedSmallWord = struct | |||
| Type.(Int32|Nat32) -> G.nop | |||
| Type.(Nat8|Nat16) as ty -> compile_shrU_const (shift_of_type ty) | |||
| Type.(Int8|Int16) as ty -> compile_shrS_const (shift_of_type ty) | |||
| Type.Char as ty -> compile_shrU_const (shift_of_type ty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, was this missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not because we always had and used dedicated untag_codepoint/tag_codepoint functions.
src/codegen/compile.ml
Outdated
@@ -3170,51 +3513,84 @@ module MakeCompact (Num : BigNumType) : BigNumType = struct | |||
try_unbox I32Type (fun _ -> match n with | |||
| 32 | 64 -> G.i Drop ^^ Bool.lit true | |||
| 8 | 16 -> | |||
compile_bitand_const Int32.(logor 1l (shift_left minus_one (n + 1))) ^^ | |||
(* Please review carefully! *) | |||
compile_bitand_const Int32.(logor 1l (shift_left minus_one (n + (32 - BitTagged.ubits_of Type.Int)))) ^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if_tagged_scalar
only chooses this fast path when the bottom bit is 0, so the logor 1l
is indeed redundant.
_arithmetic_ right shift. This is the right thing to do for signed | ||
numbers, and because we want to apply a consistent logic for all types, | ||
especially as there is only one wasm type, we use the same range for | ||
signed numbers as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was that a typo?
"we use the same range for signed numbers as well." should have read
"we use the same range for unsigned numbers as well."
That always confused me....
Not, we no longer do this except for Nat and Int, allowing an extra bit for unsigned compact numbers (compact Nat32/Nat64)
Co-authored-by: Gabor Greif <gabor@dfinity.org>
Thanks for the review! It was even bigger before I used the typed stackrep to do the tagging and untagging in one place... There each operation had to strip and add the tags, which was many changes...
I guess we could measure that and see the diff. My guess it that it is 3 extra instructions per op but less for compound expressions.
We could disable it for now, by just changing the TaggingScheme module and using all-zero tags, but then we would need to keep candid serialization around for the final upgrade to a tagged world. I think @luc-blaeser has already ripped that out in his successor PR adding graph serialization deserialization, leaving just candid deserialization for the migration. But I could be wrong.
Yes maybe we should make a release and ask people to try it without including in dfx for a while. |
What should I do about the filecheck tests? I could modify them for the new tagging scheme (but then we can't disable it easily) or just leave as is. They feel pretty fragile to me. |
@ggreif @luc-blaeser I started PR #4400 to toggle the scalar tagging via a flag in case we are nervous. Widening the compact integer ranges back to 31 bits (from 30/28/26 for When I've fixed the bug here 537876b I think the bug was benign for correctness, but would box negative numbers that are actually in the compact range, probably even small ones! |
* add flag to enable rtti * fix bugs in can_tag_i32/i64 tests and sanity checks * adjust test assert on heap size * update perf numbers * revert change * revert test * optimized clearing of all-zero tags * update perf numbers
The Motoko runtime representation of values is largely untyped, distinguishing only between scalar and boxed values
using a single bit of the 32-bit value representation. The tagging is only to support garbage collection, not precise runtime type information.
In the existing value encoding, a Motoko value in vanilla form is a 32-bit value that is either:
(0b0)
,(0b1)
,Encoded by subtracting 1 from the pointer value (ensuring the 2 LSBs are 0b11), pointing
heap allocated value
Scalar values encode
Nat8/16
andInt8/16
values and chars, and 31-bit subranges ofNat32
,Int32
,Nat64
,Int64
,Nat
andInt
. Large integer values that don't fit in a 31-bit scalar are boxed on the heap.Observe that, in Motoko, some types are always scalar (eg.
Nat8
), some types are always boxed (e.g.Blob
), and some types have a mixed scalar/boxed representation (e.g.Nat32
andNat
), depending on the size of the value.This PR adds exact runtime type information to all[*] scalar values, making the scalar values self describing.
Making the entire heap fully self-describing requires refining the heap tags use to identify heap objects, distinguishing boxed
Nat32
from boxedInt32
,Blob
fromPrincipal
andText
, tuples from (mutable and immutable) arrays etc. That work of refining heap tags will need to be completed in a follow on or sibling PR, but is hopefully less involved than the changes herein.To add precise scalar type info, we extend the scalar tagging scheme with a richer set of (inline) type descriptors, using some of the least significant bits of the 31-bit scalar representation.
To avoid dedicating a fix-length suffix (say 1 byte) to the scalar tag, scalar tags are actually variable length, using shorter tags for larger payload types, and longer tags for shorter payload types. This gives us a reasonable tag space (set of possible tags, some still unused), without reducing the scalar range of mixed representation types too much.
At one extreme, the tag of
Int
(andNat
) is just0b10
, leaving a 30-bit payload for compactNat/Int
, losing just1
bit from the current representation's 31-bit compact range. This is important becauseInt
s are common, andNat
s are used to index arrays, so we should avoid boxing more than necessary.In the middle, the tag of
Nat16
,Int16
is0b10(0^12)00
and0b11(0^12)00
, leaving a 16-bit payload in the MSB.At the other extreme, the tag of the unit value,
()
, is 32-bit0x01(0^28)00
, occupying the entire value.The primary motivation of this work is to support value, not type driven, serialization of stable values to a precisely typed stable format, without loss of type information, so that upgrades can still accommodate type dependent changes of representation from one in-memory format to another. Secondary motivations are live and post-mortem heap inspection tools and light-weight debugging tools, that can parse values in locals, arguments and on the heap using tags.
[*] There remain some raw, untagged 31-bit scalars whose type is only known to the compiler. These are used to encode the state of text and blob iterators, hidden in dedicated iterator closure environments. Note that these are not stable types, so need not be precisely tagged for stabilization.
Tagging Scheme
((O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))
((O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,I))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,I))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,O))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,O,I,O,O))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,I,I,O,O))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,O,I,O,O,O))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,I,I,O,O,O))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,O,I,O), (O,O,O,O,O,O,O,O))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))
((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (I,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))
((_,_,_,_,_,_,_,_), (O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))
((_,_,_,_,_,_,_,_), (I,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))
((O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))
Implementation
The implementation was carried out in a number of precursor PRs:
Nat32/Int32
andNat32/Nat64
, making the payload size type-dependent.The previously untyped StackReps
UnboxedWord32
andUnboxedWord64
were extended to carry a typeargument. The argument is used to remember and re-introduce the precise tag on unboxing and boxing.
It can also be used to verify the tag on unboxing, for sanity checking.
Int/Nat
from 29 to 30-bit, by adjusting the tagging scheme. This is just 1 bit lessthan with the existing scheme (31-bit, untagged scalars).
UnboxedWord32/Word64
stack reps also for untagged, 0-right-padded small tagged values,tagging/untagging only on exit to and from stack.
This alone reduces the (large) 80% overhead in bench/nat16.mo to 55%.
It also has the advantage of reverting almost all changes to the arithmetic code,
which can now (again) assume values are right, 0-padded as it did previously,
the existing optimization done for mutable locals containing unboxed
Nat32
/Int32
andInt64
/Nat64
.This reduces the
bench/nat16.mo
overhead from 55% to just 6% (the benchmark use repeated in-place updates in a tight loop so benefits greatly).This PR also makes use of the previously unused bit in the the compact representation of
Nat32s
andNat64s
which previously had to concur with the representation ofInt32
andNat64
and could only represent half the unsigned range.With the typed StackRep, we now know whether the values are signed or not and can choose distinct compact
representation for
Nat32
vsInt32
, andNat64
vsInt64
rather that shared ones.Note however, that the compact representation for
Nat
cannot recover the missing bit because of subtyping.A compact
Nat
must have the same representation as a compactInt
to support non-coercive subtyping.Mo_config.Flags.rtti (default off)
, avoiding overhead for now.--experimental-rtti
to enable feature for performance feedback from users.Overheads
These are the cycle count and code size differences measured using
test/bench
andtest/perf
, compared against master (see spreadsheet for perf of interim PRs).Summarized from:
https://docs.google.com/spreadsheets/d/1zC2Hsl9gGUzJESQmSABPiu-XIsICEw1I3O-JKHNWVQs/edit?usp=sharing
perf
test/perf
test/bench