Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup int, uint, uint64, Int, Uint, UInt, UVarint, Uint64, etc. #615

Open
icorderi opened this issue Oct 31, 2019 · 13 comments
Open

Cleanup int, uint, uint64, Int, Uint, UInt, UVarint, Uint64, etc. #615

icorderi opened this issue Oct 31, 2019 · 13 comments

Comments

@icorderi
Copy link

@icorderi icorderi commented Oct 31, 2019

The spec has uppercase, lower case, newtype defs, encoding expectations, size bounds all used indiscriminately in the doc.

We need to finalize expected size bound and encoding for each numerical type.

We proposed the following sets of numbers:

  • Special number types:
    • Varint64 (64-bit number encoded cheaply on low values),
    • UVarint64 (64-bit unsigned number encoded cheaply on low values),
    • BigInt (arbitrary length integer, currently using leb128 encoding)
    • optional [U]Varint{16-32} to complete to the family (this doesn't affect the chain storage but puts memory size bounds and helps advise implementations on expected sizes, i.e. a UVarint64 for a MethodID inside an actor seems a bit much).
  • Primitive number types: uint{8-64}, int{8-64} (this are NOT varlen encoded, LE/BE needs to be specified in the encoding section of the spec)

Note: Varint and UVarint default to 64-bit, but it might be best to be consistent across the board and express numbers with their sizes.

@zixuanzh

This comment has been minimized.

Copy link
Contributor

@zixuanzh zixuanzh commented Oct 31, 2019

@icorderi, yes, this was left as a TODO. Will put up a PR to clean up these types. Your PR will be welcomed too ❤️

@Kubuxu

This comment has been minimized.

Copy link
Contributor

@Kubuxu Kubuxu commented Oct 31, 2019

BigInt (arbitrary length integer, currently using leb128 encoding)

I'm not aware of BigInt being leb128 encoded. UVarint* are leb128 encoded.
Lotus currently encodes BigInt as big-endian bytearray.

@Kubuxu

This comment has been minimized.

Copy link
Contributor

@Kubuxu Kubuxu commented Nov 2, 2019

Another thing is: we are using CBOR, we should follow CBOR encoding rules: https://tools.ietf.org/html/rfc7049#section-2.1 (Major type 0 and Major type 1).

@hannahhoward

This comment has been minimized.

Copy link
Contributor

@hannahhoward hannahhoward commented Nov 2, 2019

^^^ +1 to @Kubuxu 's suggestion -- as I see it, CBOR is the lingau franca of Filecoin data exchange, and we should focus on staying in within rules of type encoding to CBOR, since we frequently then calculate CIDs that are dependent on that encoding.

@jbenet

This comment has been minimized.

Copy link
Member

@jbenet jbenet commented Nov 4, 2019

Thanks for bringing this up. will take a look this week.

(some quick notes)

  • Most numbers will be small in most cases but need to be able to grow (therefore varints)
  • Yes i believe Varints should be LEB128 encoded
  • (Correct me if im wrong but) i believe CBOR major types only go up to uint64_t. not real varints? (some numbers will get bigger than uint64). To use this serialization types in most places, we would either need to restrict to only using numbers that fit in uint64 or define what happens when the number is bigger. (prefer the former)
  • To minimize confusion, we should aim for the least amount of number types (not a fan of having all of [u]int{8, 16, 32, 64} around in code)
  • To minimize chain bandwidth, we should aim for most compressed representations (which i think will be CBOR Major Type 0/1?)

potentially three paths:

  • (1) Varint-first: (most future proofed, less cbor friendly, less number types)
    • use Varint and UVarint for almost everything. Encode them as they need to be in CBOR (LEB128?)
    • use BigInt where necessary (probably leb128 encoded)
  • (2) CBOR-ints first: (less future proofed, more number types)
    • Use all 8 [u]int{8,16,32,64} int types for most data structures -- leverage CBOR encoding
    • Use UVarint/Varint where numbers need to grow arbitrarily large
    • Use BigInt where necessary (probably leb128 encoded)
  • (3) CBOR-ints first -- simpler: (hybrid of the two -- CBOR first, but using Varint types in spec code)
    • use Varint64 and UVarint64 for almost everything. Encode them as CBOR ints
    • DO NOT use [u]int{8,16,32,64} in code -- let serialization use it in CBOR.
    • use BigInt where necessary (probably leb128 encoded), including numbers that need to grow arbitrarily large

(cc @jzimmerman @whyrusleeping who both likely care about this)

@Kubuxu

This comment has been minimized.

Copy link
Contributor

@Kubuxu Kubuxu commented Nov 4, 2019

(Var)ints have custom encoding in CBOR (range: space needed):

  • [0, 23] : 0 - tag + int share space
  • [24, 2^8): 1
  • [2^8, 2^16): 2
  • [2^16, 2^32): 4
  • [2^32, 2^64): 8

Negative numbers have similar ranges.

BigInts are length prefix encoded in CBOR.
So:

  • 0: 1 - tag + length=0, means it is zero
  • rest: (Var)int encoding of byte length + log2(x)/8 (big endian encoding of bigint).
@jzimmerman

This comment has been minimized.

Copy link
Collaborator

@jzimmerman jzimmerman commented Nov 4, 2019

I think it might be clearer to separate the question of in-memory representation (and hence the types listed in the fields) from the question of serialization:

  • For the in-memory representation, I tend to agree that in order to minimize confusion, the fewer types we need, the better. I think it would be reasonable to have just two: Int (meaning native 64-bit integer), and BigInt (meaning arbitrary length).

    • It may or may not make sense to also have unsigned versions (i.e., UInt/UBigInt in this case). Personally, I would be inclined to stick with just the signed versions, but if we do have unsigned versions, I think we should be very clear on the meaning of "unsigned": is it just a form of documentation (i.e., "we expect this value will be nonnegative but do not guarantee it"), or does it have tangible effects (i.e., is 1 - 2 < 0, is 1 - 2 > 0, or is 1 - 2 an error?)
  • For the serialized representation, I would think CBOR integer representation (as detailed by @Kubuxu above) should work well. Using LEB128 directly would be slightly more space-efficient in certain regimes, presumably (e.g., small integers between 24 and 128 would require one byte instead of two), but at the cost of not being CBOR-compatible; not clear if this is worth the savings.

@hannahhoward

This comment has been minimized.

Copy link
Contributor

@hannahhoward hannahhoward commented Nov 5, 2019

+1 for @jzimmerman 's approach.

@dignifiedquire

This comment has been minimized.

Copy link
Member

@dignifiedquire dignifiedquire commented Nov 5, 2019

@jzimmerman

This comment has been minimized.

Copy link
Collaborator

@jzimmerman jzimmerman commented Nov 5, 2019

@dignifiedquire Could that happen at the API level? I.e., in standard CPU code outside of proofs, the values are of type Int (64-bit), but we perform a check that they fit within the appropriate bit width before generating the proof, and the generation/verification process errors out if not?

@dignifiedquire

This comment has been minimized.

Copy link
Member

@dignifiedquire dignifiedquire commented Nov 5, 2019

@dignifiedquire

This comment has been minimized.

Copy link
Member

@dignifiedquire dignifiedquire commented Nov 5, 2019

@jzimmerman

This comment has been minimized.

Copy link
Collaborator

@jzimmerman jzimmerman commented Nov 5, 2019

Ah, I see -- yes, I don't think there's an issue with having these distinct integer types internal to the algorithm implementations; the concern was mainly regarding the data structures that need to be passed between APIs and/or go on chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.