Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quo vadis blob #966

Open
nomeata opened this issue Dec 2, 2019 · 25 comments
Open

Quo vadis blob #966

nomeata opened this issue Dec 2, 2019 · 25 comments
Assignees
Labels
feature New feature or request language design Requires design work library Base or other libraries P1 high priority, resolve before the next milestone

Comments

@nomeata
Copy link
Collaborator

nomeata commented Dec 2, 2019

In #963 I am introducing Blob type. This is initially just scaffolding to not have #789 blocked on the question of whether we want a (more) abstract Entity/Principal/Caller/Canister/User type, and be able to write some code.

Presumably, the caller will turn into some kind of abstract type, possibly with conversions to and from Text and/or Blob.

But what about Blob itself? I see a few directions this can take:

  1. We keep Blob as a primitive type, just like Text, and add operations (various conversion to and from text, maybe even indexing) as needed.
  2. We define type Blob = [Word8], and live with the size penalty due to the uniform representation.
  3. We define type Blob = [Word8], but change the representation to pack arrays of 8, 16, 32 and 64bit sized types upon construction, with separate heap object types for these. All array access would dynamically dispatch as needed. We’d be trading execution speed for space usage (but may be a good deal on our platform).

With 2 and 3 there is the problem that we would lose the equality and comparison operators, which are really kinda useful.

Somewhat independent there is the question if we want blob literal syntax.

@nomeata nomeata changed the title Quot valid blob Quot vadis blob Dec 2, 2019
@nomeata nomeata changed the title Quot vadis blob Quo vadis blob Dec 2, 2019
@nomeata nomeata added the feature New feature or request label Dec 11, 2019
@nomeata
Copy link
Collaborator Author

nomeata commented Dec 11, 2019

BTW, I my preference is 3.

If we also pack arrays of 0-bit-sized elements we may even use this to gracefully deal with an IDL bomb of a large vec null or vec any array.

@ggreif
Copy link
Contributor

ggreif commented Jan 9, 2020

@eftychis @enzoh @matthewhammer You folks asked for Blob inspection routines. Here is my current plan on providing you functionality. Please chime in if you need something that is not implementable in user space using the below.

Since we don't have Blob literals, I am hesitant to include Blob into debug_show, but I am open for arguments (with formatting suggestion).

@chenyan-dfinity
Copy link
Contributor

If only for debugging, we can return Blob and inspect the IDL bytes.

@ggreif
Copy link
Contributor

ggreif commented Jan 12, 2020

More elimination forms are defined in #1100. What intro forms do we need? From [Word8] and [var Word8] perhaps? Text? Probably not.

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 12, 2020

[Word8] is the natural one

@crusso
Copy link
Contributor

crusso commented Jan 13, 2020

Doesn't option 3 need type passsing? What's the representation for [T] and what happens when you instantiate T at Word8 etc? @nomeata

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 13, 2020

Option 3 would be different heap representation (ArrayBoxed, ArrayWord8, ArrayWord16, ArrayWord32, ArrayWord64), with different heap object tags, and a dynamic dispatch on every array operation (no matter what the static type). Not quite type passing.

@crusso
Copy link
Contributor

crusso commented Jan 13, 2020

That could work, but not sure it fits our KISS (Keep it slow stupid) motto.

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 13, 2020

I think we are fine: It’s also slow in its own way, due to the branch on the tag

@nomeata
Copy link
Collaborator Author

nomeata commented Apr 16, 2020

Should we revise that discussion? It irks me that Blob doesn’t round-trip through Candid: We export Blob as [nat8], but import [nat8] as [Nat8], which is quite expensive.

Oh, but there is

With 2 and 3 there is the problem that we would lose the equality and comparison operators, which are really kinda useful.

Maybe we need quality and comparison on arrays? :-)

@chenyan-dfinity
Copy link
Contributor

Now that we have Principal, what's the real use case for Blob? I think we will need an intro form for Principal. This makes testing easy, and we can already round-trip Principal.

@nomeata
Copy link
Collaborator Author

nomeata commented Apr 16, 2020

Yes, we have

func blobOfPrincipal(id : Principal) : Blob
func principalOfActor(act : actor {}) : Principal

in Prim. So you can already turn a blob into a principal using principalOfActor(actor (textual_representation(blob)) if you implement the textual representation… so no reason to not add principalOfBlob directly.

what's the real use case for Blob?

Binary data of all sorts, in compact form? But yes, we should just make [Nat8] usable, right?

@chenyan-dfinity
Copy link
Contributor

chenyan-dfinity commented Apr 16, 2020

you can already turn a blob into a principal using principalOfActor(actor (textual_representation(blob))

Nice! I didn't realize this implication when you add principalOfActor :)

Binary data of all sorts

I think [Nat8] doesn't necessarily mean binary data, so we need a dedicated Blob type. But this argument feels nominal then.

@nomeata
Copy link
Collaborator Author

nomeata commented Apr 20, 2020

With 2 and 3 there is the problem that we would lose the equality and comparison operators, which are really kinda useful.

Maybe we need quality and comparison on arrays? :-)

Andreas points out that this can be Array.eq in the stdlib, so maybe not blocked on this.

@rossberg rossberg added library Base or other libraries language design Requires design work P2 medium priority, resolve within a couple of milestones labels Apr 23, 2020
@ghost ghost added P1 high priority, resolve before the next milestone and removed P2 medium priority, resolve within a couple of milestones labels May 5, 2020
@nomeata
Copy link
Collaborator Author

nomeata commented Jun 7, 2020

@rossberg, as the language design lead, can you make a judgment call?

@rossberg
Copy link
Contributor

rossberg commented Jun 8, 2020

If it wasn't for the Candid mapping, then the conservative right now would still be to keep Blob its own type, separate from arrays. Unifying them later would only make more programs type-check.

However, if we map the Blob type to Candid blob (as seems logical), then we won't be able to do that change anymore without breaking existing canisters.

Yet, the simplest and most efficient solution is to make Blob its own type. Not the absolute purism, but I began to think that it's the right pragmatic choice. So unless anybody objects, I suggest leaving it at that.

FWIW, somewhat tangential, I started growing skeptical of having regular equality extend to types like Text or Blob, where it's not constant time. I guess it's just too damn convenient for Text, but is it ever advisable for blobs?

Edit: Fixed some confusing typos.

@nomeata
Copy link
Collaborator Author

nomeata commented Jun 8, 2020

However, if we map the Blob type to Candid blob (as seems logical), then we won't be able to do that change anymore without breaking existing canisters.

Why? In Candid, blob is just a shorthand for vec nat8. So we can still do that without breaking canisters.

So unless anybody objects, I suggest leaving it at that.

Works for me. This means we have to add indexing? What else?

I guess it's just too damn convenient for Text, but is it ever advisable for blobs?

Not sure if it makes sense to distinguish the two. I don’t think that Blobs are inherently larger than Texts, or something like that.

@rossberg
Copy link
Contributor

rossberg commented Jun 9, 2020

Why? In Candid, blob is just a shorthand for vec nat8.

Ah right, thinko. So even more reason to leave it as is.

This means we have to add indexing? What else?

We could add the same pseudo methods as for arrays (get, keys), but that does not seem urgent.

I guess it's just too damn convenient for Text, but is it ever advisable for blobs?

Not sure if it makes sense to distinguish the two. I don’t think that Blobs are inherently larger than Texts, or something like that.

While there may be long instances of both, equality on strings is typically applied only to short fragments, like individual words or names. There is no obvious analogue for blobs.

@nomeata
Copy link
Collaborator Author

nomeata commented Jun 9, 2020

We could add the same pseudo methods as for arrays (get, keys), but that does not seem urgent.

We already have blob.bytes()

While there may be long instances of both, equality on strings is typically applied only to short fragments, like individual words or names. There is no obvious analogue for blobs.

Sure there are: comparing hashes would be a pretty common example.

Sequences of bytes and sequences of unicode characters – seems quite similar to me.

@rossberg
Copy link
Contributor

rossberg commented Jun 9, 2020

We already have blob.bytes()

Yes, but that corresponds to vals, not keys. But as I said, it's not urgent.

Sure there are: comparing hashes would be a pretty common example.

You mean hashes that are represented as blobs? Fair enough, though it would be more natural to represent these as big numbers.

@rossberg
Copy link
Contributor

@nomeata, do you still have appetite for a non-uniform array representation, or should we close this?

@nomeata
Copy link
Collaborator Author

nomeata commented Oct 27, 2020

I still have appetite at having a more coherent plan for Blob, and I wouldn’t mind the non-uniform array representation if that’s part of the solution (and since that would also benefit people who use [Nat64] etc.).

@nomeata
Copy link
Collaborator Author

nomeata commented Oct 27, 2020

Another direction would be to think of Blob less of a variant of [Nat8], but rather a variant of Text (with different “characters”), i.e. no indexing, maybe cheap concatentation, iterators. No strong convictions here at the moment.

nomeata added a commit that referenced this issue Apr 8, 2021
until we have the full story for where we go with `Blob` (#966), we
should at least provide sufficient ways to create blobs, as requested,
for example, in dfinity/motoko-base#242.

This adds
```
func blobToArray(b : Blob) : [Nat8]
func blobToArrayMut(b : Blob) : [var Nat8]
func arrayToBlob(a : [Nat8]) : Blob
func arrayMutToBlob(a : [var Nat8]) : Blob
```
to `Prim`, and I plan to expose them in base as `Blob.ofArray(Mut)` and
`Blob.toArray(Mut)`.

Performance is bad right now, with all the copying (and bounds checks in
`Arr.idx`…), but that gives us some low hanging fruit for later.
nomeata added a commit that referenced this issue Apr 8, 2021
until we have the full story for where we go with `Blob` (#966), we
should at least provide sufficient ways to create blobs, as requested,
for example, in dfinity/motoko-base#242.

This adds
```
func blobToArray(b : Blob) : [Nat8]
func blobToArrayMut(b : Blob) : [var Nat8]
func arrayToBlob(a : [Nat8]) : Blob
func arrayMutToBlob(a : [var Nat8]) : Blob
```
to `Prim`, and I plan to expose them in base as `Blob.ofArray(Mut)` and
`Blob.toArray(Mut)`.

Performance is bad right now, with all the copying (and bounds checks in
`Arr.idx`…), but that gives us some low hanging fruit for later.
@skilesare
Copy link

I'd still love concatenation. Current use case from the IC Spec:

Representation-independent hashing of structured data

Structured data, such as (recursive) maps, are authenticated by signing a representation-independent hash of the data. This hash is computed as follows (using SHA256 in the steps below):

For each field that is present in the map (i.e. omitted optional fields are indeed omitted):

concatenate the hash of the field's name (in ascii-encoding, without terminal \x00) and the hash of the value (with the encoding specified below).
Sort these concatenations from low to high

Concatenate the sorted elements, and hash the result.

The resulting hash of 256 bits (32 bytes) is the representation-independent hash.

The following encodings of field values as blobs are used

Binary blobs (canister_id, arg, nonce, module) are used as-is.

Strings (request_type, method_name) are encoded in UTF-8, without a terminal \x00.

Natural numbers (compute_allocation, memory_allocation, ingress_expiry) are encoded using the shortest form Unsigned LEB128 encoding. For example, 0 should be encoded as a single zero byte [0x00] and 624485 should be encoded as byte sequence [0xE5, 0x8E, 0x26].

Arrays (paths) are encoded as the concatenation of the hashes of the encodings of the array elements.

Maps (sender_delegation) are encoded by recursively computing the representation-independent hash.

@rossberg
Copy link
Contributor

rossberg commented Nov 2, 2022

@skilesare, you never want concatenation for any use case where you have to do it repeatedly, since that results in quadratic cost. I think what you'd want is a Buffer-like thing to construct Blobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request language design Requires design work library Base or other libraries P1 high priority, resolve before the next milestone
Projects
None yet
Development

No branches or pull requests

6 participants