Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify bitfield format; size, ordering, etc. #54

Open
rvagg opened this issue Jul 22, 2020 · 7 comments
Open

Simplify bitfield format; size, ordering, etc. #54

rvagg opened this issue Jul 22, 2020 · 7 comments

Comments

@rvagg
Copy link
Member

rvagg commented Jul 22, 2020

In ipld/specs#282 I'm claiming that the map (bitfield) is addressed here in the same way as the specification, and therefore the reference implementation for the IPLD HashMap. The spec currently doesn't use super clear language about how you address a specific bit of a byte array but it's assuming standard LE addressing where "reading index bit of the byte array" implies a certain ordering that you can apply masks to. I don't know how the Go bitfield works but it would be good to verify if the bit checking used here matches operations on the same byte array as used here https://github.com/rvagg/iamap/blob/master/bit-utils.js and if not, document exactly how it works so implementers can be crystal clear.

@rvagg rvagg added the need/triage Needs initial labeling and prioritization label Jul 22, 2020
@rvagg
Copy link
Member Author

rvagg commented Jul 22, 2020

/cc @warpfork

@rvagg rvagg mentioned this issue Jul 22, 2020
@rvagg rvagg changed the title TODO: verify byte ordering of the bitfield matches spec TODO: clarify bitfield format; size, ordering, etc. Aug 7, 2020
@rvagg
Copy link
Member Author

rvagg commented Aug 7, 2020

Changed the title of this after playing a bit with the format. Dropping some thoughts here as I explore this.

The bitfield at the moment uses a big.Int which can spit out a type of big-endian format. (some of the guts are in https://golang.org/src/math/big/nat.go with the API in https://golang.org/src/math/big/int.go). We're using the Bytes() and SetBytes() methods for serialization and deserialization. What this seems to give us is the most compact big-endian representation of the number it holds. We're using SetBit() to set individual bits on and off to indicate the presence or absence of an element in our data array, so the number is arbitrary, it's the bits that matter.

The maximum length of the bitfield should be 2^bitWidth to hold enough bits to cover all the indexing we need for any node. So a bitWidth of 8 gives us 256 bits needed to store our bitmap data. So if we were to turn on all the bits because we have a full data array, we'd end up serializing 0xff... for 32-bytes, i.e. 256 1's. But if we only tinker with the first 8 bits, then we only need to serialize one byte. e.g. if we only had an element at index 0 then our bitfield would be a single byte, 0x01, but if we only set the 8th bit then we need to bytes so would serialize 0x0100, and so on.

Filecoin only uses a bitWidth of 5, so that's 32-bits, or 4-bytes needed to represent the bitfield.

Some thoughts about this format:

  • It's somewhat Go specific. It's convenient to get these from a big.Int and set them to a big.Int but it's going to be slightly annoying for everyone else unless they have something already that works in exactly the same way. The ideal internal representation is for a node to have a bitfield of exactly 2^bitWidth ready and available to set and unset bits on. The convenience of big.Int bypasses this entirely but that's not going to be the same story across languages. I have https://github.com/rvagg/iamap/blob/master/bit-utils.js for this in JS, but to go to and from this serialization format I'd have to be trimming left-most bytes that contain zeros on the way in and padding them back on the way out.
  • The randomness of the hash algorithm means that the chances of setting the 32nd bit are the same as setting the 1st. There's no bias toward smaller bits built-in here. So we buy some compaction in the serialization by using this truncated format, but it's only going to get us so far in a HAMT that's got more than a few values in it.
  • In Filecoin the bitWidth is 5, so 32 bits, or a maximum of 4 bytes in the serialization format of this field. There will be some cases with only 3, less with 2 and even less with just 1. We're saving bytes, but not many, and at the cost of complexity for everyone but Go implementers.
  • It's hard to validate canonical block form. The current implementation will (probably) take an arbitrarily long byte array and turn it into a valid big.Int. It'll treat as valid a block that has a byte array 1000 long in the position for the bitfield (I believe big.Int can handle this kind of arbitrary size). But then it should round-trip it back out as just-long-enough if the block was re-serialized. (So this is in a similar category to the problems suggested in Require canonical (round-trip identity) CBOR encoding for externally-provided data structures specs#1045).

I don't have a strong opinion here yet, would like to hear others' thoughts. My personal preference would be for it to be stable and consistent, with the bitfield byte array in CBOR being exaclty 2^bitWidth (exactly 4 bytes, every time, for Filecoin) so serialization, validation and explanation of this spec is simpler than it currently is. I doubt that the number of bytes being saved here is very meaningful—but it's not zero.

@warpfork @Stebalien @anorth thoughts?

@warpfork
Copy link
Contributor

warpfork commented Aug 7, 2020

Thoughts in no particular order:

  • Agree that this is definitely a canonicalization concern and is important to specify.
  • Therefore, agree that there should be a mode flag (on by default, or just hardcoded as true) for strict validation of the bitfield byte size in serial data, one way or another.
  • Expending an expected value of (4-1)/2 = 1.5 bytes as "waste" per bitfield (given max four bytes, and random distribution, as you outlined) seems near to a rounding error... and that's the cost when one bit is set; the expected value for 'waste' decreases as more bits are set. So if even a handful of bits are set, the expected value of waste is less than a single byte. Seems pretty cheap for the consistency and simplicity gain of a fixed size.
    • And there's one bitfield stored per block, just to sanity check I've got that right? Yeah, this seems very low cost indeed.
  • Agree that seeing bigint just generally sets my spider senses a'tingle. There's nothing that's hard-edged wrong about it, but it would make me more comfortable if we didn't have it.
    • Anecdata: I have (distant) unhappy memories of learning the details of Java's BigInt the hard way; something about the sign being stored in not at all the way I would've initially expected, and this reflecting how many bytes were in the serial form. I don't have specific opinions about Golang's big.Int, and I haven't made a specific comparison of it to other bigints implementations such as Java's BigInt... but I bet they're different in "interesting" ways; and even if it turns out they aren't, I think I'd rather not have to think about it (and more importantly, not ask other implementors in other languages to do it either, for whatever their BigInt options might be).

So, I haven't estimated how much work it would be to change this, but I think I broadly agree with your preference for the bitfield byte array to be exactly 2^bitWidth.

@ianopolous
Copy link
Contributor

@warpfork For comparison, we don't use Java's BigInteger in our implementation, but BitSet which is much simpler and I think inline with go's big.Int?
https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html

@Stebalien
Copy link
Member

I agree with just storing the entire bitmap.

@anorth
Copy link
Member

anorth commented Aug 8, 2020

Storing the bitfield explicitly sgtm.

@rvagg
Copy link
Member Author

rvagg commented Aug 13, 2020

@anorth anorth changed the title TODO: clarify bitfield format; size, ordering, etc. Simplify bitfield format; size, ordering, etc. Aug 25, 2020
@anorth anorth removed the need/triage Needs initial labeling and prioritization label Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants