fix(manifest): emit Avro fixed schema for FixedType partition columns#876
Closed
alliasgher wants to merge 5 commits into
Closed
fix(manifest): emit Avro fixed schema for FixedType partition columns#876alliasgher wants to merge 5 commits into
alliasgher wants to merge 5 commits into
Conversation
Partition columns of Iceberg FixedType were being written into manifest partition schemas as Avro "bytes" with a stale TODO, because hamba/avro cannot dispatch [N]byte arrays inside a union (hamba/avro#571). That silently produced spec-non-compliant manifests: any compliant reader expects "fixed[N]" for those columns and sees a type mismatch. Work around the upstream limitation and emit the correct schema: * partitionTypeToAvroSchema now builds a per-length named fixed schema (fixed_N) wrapped in the usual nullable union. * Track FixedType partition fields in a new partFieldIDToFixedPartSize map threaded from getFieldIDMap through both ManifestWriter and ManifestReader. Decimal-scale tracking (stored in the misleadingly named fieldIDToFixedSize) is unchanged. * At encode time, convertFixedValue pads/truncates the partition []byte to the declared size and hands hamba/avro a {"fixed_N": [N]byte} union branch, which it does accept. * At decode time, unwrapFixedValue reverses the wrap back into []byte so downstream Iceberg code keeps its existing slice-based shape. Adds a round-trip regression test for a non-nil fixed partition value that asserts both the on-wire schema (fixed branch with the right size) and that the value survives encode → decode as the original []byte. Fixes apache#845 Signed-off-by: Ali <ali@kscope.ai>
Signed-off-by: Ali <alliasgher123@gmail.com>
laskoviymishka
requested changes
Apr 13, 2026
Contributor
laskoviymishka
left a comment
There was a problem hiding this comment.
The spec violation is real and this is the right direction, bytes for FixedType partition columns is wrong, fixed_N naming matches the Java reference impl. But there are gaps that need addressing before merge.
Main concerns:
- convertFixedValue reuses padOrTruncateBytes — silent truncation of partition values is a data correctness risk, not a "just in case" convenience
- Test only covers raw avro marshal/unmarshal — doesn't exercise the actual manifest read/write path at all
- getFieldIDMap now returns 4 bare maps of similar types — fragile, should be a struct
Details in line comments.
…lect usage Two review fixes for the FixedType partition encoding: convertFixedValue: - Drop the FixedLiteral branch — partition values in map[int]any are always plain []byte by the time they reach this function (they come from avro decode or DataFile.Partition(), never as the named type). Keeping speculative dead code in a serialization path is where bugs hide. - Replace padOrTruncateBytes with a strict length check. Silent truncation of a partition value would corrupt the partition assignment and cause queries to return wrong results; panicking loudly is the correct behaviour for an upstream invariant violation. unwrapFixedValue: - Add a comment explaining why reflect is needed: hamba/avro decodes a fixed union branch as a Go [N]byte array (not a []byte slice), and reflect is the only portable way to copy out of an array whose element count is not known at compile time. Signed-off-by: Ali <alliasgher123@gmail.com>
The function previously returned four bare values, two of which were map[int]int with entirely different semantics (decimal scale vs FixedType byte length). A caller that swapped the last two arguments would compile silently and produce wrong results. Introduce partitionFieldIDMap with named fields: nameToID map[string]int logicalTypes map[int]avro.LogicalType decimalScales map[int]int // decimal scale per field id fixedByteSizes map[int]int // byte length per FixedType field id Update ManifestReader and ManifestWriter to unpack via the struct. No logic change. Signed-off-by: Ali <alliasgher123@gmail.com>
…s for FixedType partitions The existing TestFixedPartitionColumnAvroSchema only tested raw avro marshal/unmarshal. The actual manifest code paths were untested: getFieldIDMap → fixedByteSizes → avroPartitionData → convertFixedValue → ManifestWriter.addEntry (write side) convertAvroValueToIcebergType → unwrapFixedValue (read side) Add two tests to cover them: TestFixedTypePartitionManifestRoundTrip Creates a table schema with a FixedType partition column, writes an entry via WriteManifest, reads it back via ReadManifest, and asserts the partition value survives byte-for-byte. TestFixedTypePartitionBackwardCompatibility Simulates a manifest written by old code (bytes-typed partition column instead of fixed_N). Constructs the Avro payload directly with a bytes-typed partition field via ocf.NewEncoderWithSchema, then reads it with the new code and asserts the value is returned as []byte without panicking. This confirms backward compatibility: the old encoding path bypasses fixedByteSizes and returns the []byte as-is. Signed-off-by: Ali <alliasgher123@gmail.com>
Contributor
Author
|
Closing in favour of #881 by @tanmayrauth — apologies for the overlap. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Partition columns of Iceberg
FixedTypewere being written into manifest partition schemas as Avrobyteswith a stale TODO inschema_conversions.go, because hamba/avro cannot dispatch[N]bytearrays inside a union (hamba/avro#571). That silently produced spec-non-compliant manifests — any compliant reader expectsfixed[N]for those columns and sees a type mismatch on round-trip.This PR works around the upstream limitation and emits the correct schema:
partitionTypeToAvroSchemanow builds a per-length named fixed schema (fixed_N) wrapped in the usual nullable union.partFieldIDToFixedPartSizemap threaded fromgetFieldIDMapthrough bothManifestWriterandManifestReader. The existing decimal-scale tracking (stored in the misleadingly namedfieldIDToFixedSize) is unchanged.convertFixedValuepads/truncates the partition[]byteto the declared size and hands hamba/avro a{"fixed_N": [N]byte}union branch, which it does accept.unwrapFixedValuereverses the wrap back into[]byteso downstream Iceberg code keeps its existing slice-based shape.Fixes #845
Test plan
go test ./...go vet ./...TestFixedPartitionColumnAvroSchemaasserts both the on-wire schema (fixed branch with the right size) and that a non-nil fixed partition value survives encode → decode as the original[]byte.Signed-off-by: Ali ali@kscope.ai