Simplify bitfield format; size, ordering, etc. #54

rvagg · 2020-07-22T02:44:47Z

In ipld/specs#282 I'm claiming that the map (bitfield) is addressed here in the same way as the specification, and therefore the reference implementation for the IPLD HashMap. The spec currently doesn't use super clear language about how you address a specific bit of a byte array but it's assuming standard LE addressing where "reading index bit of the byte array" implies a certain ordering that you can apply masks to. I don't know how the Go bitfield works but it would be good to verify if the bit checking used here matches operations on the same byte array as used here https://github.com/rvagg/iamap/blob/master/bit-utils.js and if not, document exactly how it works so implementers can be crystal clear.

The text was updated successfully, but these errors were encountered:

rvagg · 2020-07-22T02:50:54Z

/cc @warpfork

rvagg · 2020-08-07T04:27:30Z

Changed the title of this after playing a bit with the format. Dropping some thoughts here as I explore this.

The bitfield at the moment uses a big.Int which can spit out a type of big-endian format. (some of the guts are in https://golang.org/src/math/big/nat.go with the API in https://golang.org/src/math/big/int.go). We're using the Bytes() and SetBytes() methods for serialization and deserialization. What this seems to give us is the most compact big-endian representation of the number it holds. We're using SetBit() to set individual bits on and off to indicate the presence or absence of an element in our data array, so the number is arbitrary, it's the bits that matter.

The maximum length of the bitfield should be 2^bitWidth to hold enough bits to cover all the indexing we need for any node. So a bitWidth of 8 gives us 256 bits needed to store our bitmap data. So if we were to turn on all the bits because we have a full data array, we'd end up serializing 0xff... for 32-bytes, i.e. 256 1's. But if we only tinker with the first 8 bits, then we only need to serialize one byte. e.g. if we only had an element at index 0 then our bitfield would be a single byte, 0x01, but if we only set the 8th bit then we need to bytes so would serialize 0x0100, and so on.

Filecoin only uses a bitWidth of 5, so that's 32-bits, or 4-bytes needed to represent the bitfield.

Some thoughts about this format:

It's somewhat Go specific. It's convenient to get these from a big.Int and set them to a big.Int but it's going to be slightly annoying for everyone else unless they have something already that works in exactly the same way. The ideal internal representation is for a node to have a bitfield of exactly 2^bitWidth ready and available to set and unset bits on. The convenience of big.Int bypasses this entirely but that's not going to be the same story across languages. I have https://github.com/rvagg/iamap/blob/master/bit-utils.js for this in JS, but to go to and from this serialization format I'd have to be trimming left-most bytes that contain zeros on the way in and padding them back on the way out.
The randomness of the hash algorithm means that the chances of setting the 32nd bit are the same as setting the 1st. There's no bias toward smaller bits built-in here. So we buy some compaction in the serialization by using this truncated format, but it's only going to get us so far in a HAMT that's got more than a few values in it.
In Filecoin the bitWidth is 5, so 32 bits, or a maximum of 4 bytes in the serialization format of this field. There will be some cases with only 3, less with 2 and even less with just 1. We're saving bytes, but not many, and at the cost of complexity for everyone but Go implementers.
It's hard to validate canonical block form. The current implementation will (probably) take an arbitrarily long byte array and turn it into a valid big.Int. It'll treat as valid a block that has a byte array 1000 long in the position for the bitfield (I believe big.Int can handle this kind of arbitrary size). But then it should round-trip it back out as just-long-enough if the block was re-serialized. (So this is in a similar category to the problems suggested in Require canonical (round-trip identity) CBOR encoding for externally-provided data structures specs#1045).

I don't have a strong opinion here yet, would like to hear others' thoughts. My personal preference would be for it to be stable and consistent, with the bitfield byte array in CBOR being exaclty 2^bitWidth (exactly 4 bytes, every time, for Filecoin) so serialization, validation and explanation of this spec is simpler than it currently is. I doubt that the number of bytes being saved here is very meaningful—but it's not zero.

@warpfork @Stebalien @anorth thoughts?

warpfork · 2020-08-07T11:00:05Z

Thoughts in no particular order:

Agree that this is definitely a canonicalization concern and is important to specify.
Therefore, agree that there should be a mode flag (on by default, or just hardcoded as true) for strict validation of the bitfield byte size in serial data, one way or another.
Expending an expected value of (4-1)/2 = 1.5 bytes as "waste" per bitfield (given max four bytes, and random distribution, as you outlined) seems near to a rounding error... and that's the cost when one bit is set; the expected value for 'waste' decreases as more bits are set. So if even a handful of bits are set, the expected value of waste is less than a single byte. Seems pretty cheap for the consistency and simplicity gain of a fixed size.
- And there's one bitfield stored per block, just to sanity check I've got that right? Yeah, this seems very low cost indeed.
Agree that seeing bigint just generally sets my spider senses a'tingle. There's nothing that's hard-edged wrong about it, but it would make me more comfortable if we didn't have it.
- Anecdata: I have (distant) unhappy memories of learning the details of Java's BigInt the hard way; something about the sign being stored in not at all the way I would've initially expected, and this reflecting how many bytes were in the serial form. I don't have specific opinions about Golang's big.Int, and I haven't made a specific comparison of it to other bigints implementations such as Java's BigInt... but I bet they're different in "interesting" ways; and even if it turns out they aren't, I think I'd rather not have to think about it (and more importantly, not ask other implementors in other languages to do it either, for whatever their BigInt options might be).

So, I haven't estimated how much work it would be to change this, but I think I broadly agree with your preference for the bitfield byte array to be exactly 2^bitWidth.

ianopolous · 2020-08-07T11:09:37Z

@warpfork For comparison, we don't use Java's BigInteger in our implementation, but BitSet which is much simpler and I think inline with go's big.Int?
https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html

Stebalien · 2020-08-08T02:26:56Z

I agree with just storing the entire bitmap.

anorth · 2020-08-08T04:46:52Z

Storing the bitfield explicitly sgtm.

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg · 2020-08-13T13:05:52Z

done in https://github.com/ipfs/go-hamt-ipld/pull/63

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg added the need/triage Needs initial labeling and prioritization label Jul 22, 2020

rvagg mentioned this issue Jul 22, 2020

Documentation #52

Merged

rvagg changed the title ~~TODO: verify byte ordering of the bitfield matches spec~~ TODO: clarify bitfield format; size, ordering, etc. Aug 7, 2020

rvagg mentioned this issue Aug 8, 2020

Load-time block format validation #59

Merged

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 13, 2020

introduce Bitmap for handling bitfield operations

b9ea0bc

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 13, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

091854c

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg mentioned this issue Aug 13, 2020

Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63

Closed

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 14, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

9332b98

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 14, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

fe0cb19

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 14, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

10ffc2b

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 14, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

57559b1

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 14, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

10aeed4

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 14, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

a9a0caf

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg referenced this issue in rvagg/go-hamt-ipld Aug 14, 2020

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

7eb4d1d

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

anorth changed the title ~~TODO: clarify bitfield format; size, ordering, etc.~~ Simplify bitfield format; size, ordering, etc. Aug 25, 2020

anorth removed the need/triage Needs initial labeling and prioritization label Aug 25, 2020

warpfork mentioned this issue Aug 26, 2020

Additional benchmarks #62

Merged

ZenGround0 mentioned this issue Nov 27, 2020

Filecoin HAMT v3 Improvements filecoin-project/FIPs#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify bitfield format; size, ordering, etc. #54

Simplify bitfield format; size, ordering, etc. #54

rvagg commented Jul 22, 2020

rvagg commented Jul 22, 2020

rvagg commented Aug 7, 2020

warpfork commented Aug 7, 2020

ianopolous commented Aug 7, 2020

Stebalien commented Aug 8, 2020

anorth commented Aug 8, 2020

rvagg commented Aug 13, 2020

Simplify bitfield format; size, ordering, etc. #54

Simplify bitfield format; size, ordering, etc. #54

Comments

rvagg commented Jul 22, 2020

rvagg commented Jul 22, 2020

rvagg commented Aug 7, 2020

warpfork commented Aug 7, 2020

ianopolous commented Aug 7, 2020

Stebalien commented Aug 8, 2020

anorth commented Aug 8, 2020

rvagg commented Aug 13, 2020