test(table): add cross-client DV format compliance tests#1131
test(table): add cross-client DV format compliance tests#1131tanmayrauth wants to merge 1 commit into
Conversation
Validates that Go's DVWriter output conforms to the spec envelope format (length + magic + bitmap + CRC32) and produces Puffin files with correct blob metadata (type, snapshot-id=-1, sequence-number=-1, cardinality, referenced-data-file). Asserts byte equivalence between Go's SerializeDV output and a Java-produced DV fixture for the same input positions, and that multi-blob Puffin files have correct independent offsets. Part 3 of apache#997.
There was a problem hiding this comment.
The structural assertions are fine, envelope layout, blob metadata, multi-blob offsets all check what they should.
But for a PR titled "cross-client DV format compliance tests," I'd expect the centerpiece to be golden-file comparison against Java-produced Puffin artifacts, and that's missing 😅 .
Only one of the three tests touches a non-Go artifact, and that artifact is a raw DV envelope (the inner bytes) rather than a Puffin file.
The Puffin layer — footer offset encoding, magic, compression flag, blob property casing — is where cross-client format bugs actually show up, and nothing here exercises it against a real foreign-produced file.
| // - snapshot-id = -1, sequence-number = -1 | ||
| // - Blob properties include "cardinality" and "referenced-data-file" | ||
| // - Inner DV envelope: [length][magic 0xD1D33964][bitmap][CRC32] | ||
| func TestCrossClientDVWriterProducesSpecCompliantBlob(t *testing.T) { |
There was a problem hiding this comment.
This is Go-writes-Go-reads — NewDVWriter produces the file, puffin.NewReader + DeserializeDV reads it back, and the cardinality is checked against df.Count() which is itself Go-derived. There's no Java/PyIceberg/Rust artifact in the loop, so a Puffin-layer regression in any of those clients wouldn't be caught. TestDVWriterSingleDataFile + verifyDVReadBack in dv_writer_test.go already covers the roundtrip.
What this test could be doing — and what would actually earn the "cross-client" name — is reading a Java-produced single-blob Puffin file from testdata/ via ReadDV, then asserting on cardinality, referenced-data-file, and bitmap contents. The testdata directory has Java-produced DV envelopes (.bin) but no Java-produced Puffin file; checking one in here is the unlock.
| // input positions. Byte equivalence is a stronger cross-client guarantee than | ||
| // roundtrip-readability: any spec-compliant reader will accept Go's output, | ||
| // and Go's output is indistinguishable from Java's at the wire level. | ||
| func TestCrossClientGoSerializeMatchesJavaFormat(t *testing.T) { |
There was a problem hiding this comment.
This is the only test that compares against an external artifact, so it's pulling a lot of weight — but the artifact is the inner DV envelope, not a Puffin file, and the fixture [1,3,5,7,9] is exactly the case that's easiest to byte-match: scattered low-cardinality positions, single roaring array container, no run encoding triggered on either side.
That makes the comment overstate the guarantee in two ways. First, it's silent on the Puffin layer (where most cross-client bugs live). Second, byte-equality doesn't generalize — Java's BitmapPositionDeleteIndex.serialize() calls runLengthEncode() before writing and Go's RoaringPositionBitmap.Serialize() doesn't, so for range-style deletes the two outputs diverge bytewise while both remaining spec-compliant.
I'd scope this test to what it actually proves (single-container DV-envelope parity for this fixture) and add coverage at two new axes:
- A Java-produced Puffin fixture (not just an inner envelope) read end-to-end through
ReadDV. - At least one DV-envelope fixture that crosses the 4096-entry array→bitmap threshold, and ideally one >2^32 to force a second roaring key — the container/key boundaries are where encoding differences actually surface.
| // TestCrossClientMultiFileDVPuffin validates that a multi-blob Puffin file | ||
| // (one blob per data file) has correct offsets and each blob is independently | ||
| // readable — matching how Java writes multi-DV Puffin files. | ||
| func TestCrossClientMultiFileDVPuffin(t *testing.T) { |
There was a problem hiding this comment.
Same shape as the comment on TestCrossClientDVWriterProducesSpecCompliantBlob — Go writes the multi-blob Puffin, Go reads it back, asserts cardinality against df.Count(). TestDVWriterMultipleDataFiles already covers this. To actually be a cross-client test, this should consume a Java-produced multi-blob Puffin from testdata/ and check that ReadDV recovers the right bitmap for each DataFile. The Puffin footer layout for multi-blob files is exactly the place where small encoding choices (offset encoding, blob-meta ordering) cause silent corruption across clients.
Validates that Go's DVWriter output conforms to the spec envelope format (length + magic + bitmap + CRC32) and produces Puffin files with correct blob metadata (type, snapshot-id=-1, sequence-number=-1, cardinality, referenced-data-file). Asserts byte equivalence between Go's SerializeDV output and a Java-produced DV fixture for the same input positions, and that multi-blob Puffin files have correct independent offsets.
Part 3 of #997.