fix(manifest): stop writing deprecated distinct_counts (field-id 111)#1102
Conversation
laskoviymishka
left a comment
There was a problem hiding this comment.
Nice, narrow fix, dropping field 111 from the v1/v2 Avro schemas is the right enforcement layer, and the PyIceberg 0.10 unblock is a very real reason to do it.
I’d hold merge for a couple of cleanup points though. Main one: removing TestWriteManifestV{1,2}KeepsDistinctCounts entirely feels a bit overkill :D
We no longer want to write distinct_counts, but we should still be able to read legacy manifests that already have field 111 on the wire. The dataFile struct still has the Avro tag, so older iceberg-go / Spark-written manifests should decode fine — but after deleting those tests, nothing pins that behavior.
I’d add a small back-compat test that bypasses WriteManifest, builds/uses a raw OCF fixture with the old schema and field 111 present, then reads it through ReadManifest and asserts the map comes back.
A few smaller things before merge:
- update
MarshalAvroEntry/EncodeDataFilegodoc:distinct_countsis dropped on encode for all versions now, not just v3 - rename/broaden
TestWriteManifestV3OmitsDistinctCountsto cover v1/v2/v3 - either remove the v3-only
prepareEntryspecial case, or mirror the nil-clear for v1/v2 too - add
// Deprecated:godoc onDataFileBuilder.DistinctValueCounts, since it’s now a public setter that won’t encode
Fix itself looks right — I’d just keep the legacy read coverage instead of deleting the whole old test shape.
The Avro wire schema declared distinct_counts on data_file v1 and v2, causing it to be emitted on every manifest entry. The Iceberg spec marks this field as "Deprecated. Do not write." (apache/iceberg format/spec.md), so writers should not emit it. Strict readers that don't include field-id 111 in their data_file read schema (e.g. PyIceberg 0.10) fail to resolve manifests written by this library with: ResolveError "File/read schema are not aligned for map, got None".
PR apache#1075's TestEncodeDecodeDataFileRoundTrip fixture populated distinct_counts on v1/v2 and asserted the field round-tripped, on the premise that v1/v2 manifest-entry schemas declared field 111. Update the test to match: populate DistinctValueCounts on every version's fixture and assert it is empty after round-trip on every version. The assertion now serves as a regression guard for the intended behavior -- manifests written by this library never carry field 111, regardless of what the source DataFile holds in memory.
6deb071 to
853eb32
Compare
|
@dhananjaykrutika I hope you don't mind that I'm gonna address @laskoviymishka's comments so we can get this across the finish line so that we can create a new RC as this issue is causing us to have an incompatibility with pyiceberg |
… v18.6.0 (#1114) Restores main CI to green by reverting only the test-expectation byte counts that #1102 ("fix(manifest): stop writing deprecated distinct_counts") inadvertently changed. ## What broke After #1102 merged to main, both the `Go` and `Audit and Verify` workflows started failing on every push, across all 6 matrix entries (ubuntu/windows/macos × Go 1.25.5/1.26.1): ``` --- FAIL: TestTableWriting (10.16s) --- FAIL: TestTableWriting/TestAddFilesPartitionedTable --- FAIL: TestTableWriting/TestAddFilesUnpartitioned --- FAIL: TestTableWriting/TestReplaceDataFiles --- FAIL: TestTableWriting/TestAddFilesPartitionedTable#01 --- FAIL: TestTableWriting/TestAddFilesUnpartitioned#01 --- FAIL: TestTableWriting/TestReplaceDataFiles#01 --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch (0.00s) --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/success --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/batch_with_records_having_different_file_paths ``` with diffs like: ``` - "added-files-size": "3590" (actual on CI / arrow-go v18.6.0) + "added-files-size": "3070" (test source, post-#1102) - ColumnSizes: 2147483545:88 (actual on CI / arrow-go v18.6.0) + ColumnSizes: 2147483545:86 (test source, post-#1102) ``` ## Root cause These two test files assert on **exact byte counts** of the on-disk parquet files written by `arrow-go`. Those byte counts depend on the parquet writer's metadata encoding, which differs across arrow-go versions. #1102 lowered the expected counts (3590→3070, 1066→963, 1816→2132+→1816, 4264→3687, 88→86, 174→172, 96→94, 187→185). Those new values only reproduce against an unreleased arrow-go build (presumably wired in via a local `go.work`). Against the pinned `github.com/apache/arrow-go/v18 v18.6.0` from `go.sum` — which every CI runner resolves — the writer still emits the original larger sizes, so the assertions fail every time. The PR's own CI runs were `CANCELLED` before completion, so this didn't surface at merge time. The functional change in #1102 (dropping `distinct_counts` field-id 111 from the manifest entry Avro schema) only alters **manifest** serialization, not the **data** parquet files whose sizes these `Summary`/`ColumnSizes` fields tally — so the test-value updates were always unrelated to the production fix and can be safely reverted on their own. ## Fix Revert only the eight byte-count lines back to the pre-#1102 values verified via `git show 51b3140^`: | File | Test | Field | Before | #1102 | This PR | |---|---|---|---|---|---| | `table/table_test.go:549,554` | `TestAddFilesUnpartitioned` | `added-/total-files-size` | 3590 | 3070 | **3590** | | `table/table_test.go:770,776` | `TestAddFilesPartitionedTable` | `added-/total-files-size` | 3590 | 3070 | **3590** | | `table/table_test.go:1136,1140,1144` | `TestReplaceDataFiles` | `added-/removed-/total-files-size` | 1066/2132/4264 | 963/1816/3687 | **1066/2132/4264** | | `table/pos_delete_partitioned_fanout_writer_test.go:77` | `success` | `ColumnSizes` | 88,174 | 86,172 | **88,174** | | `table/pos_delete_partitioned_fanout_writer_test.go:87` | `batch_with_records_having_different_file_paths` | `ColumnSizes` | 96,187 | 94,185 | **96,187** | No production code changes — #1102's manifest fix is preserved intact. ## Verification Reproduced CI failure locally with `GOWORK=off go test ./...` (forces use of pinned v18.6.0): - **Before this PR**: same 8 sub-test failures with identical expected/actual diffs as CI. - **After this PR**: `ok` across every package — `iceberg-go`, `catalog/{glue,hadoop,hive,rest,sql}`, `cmd/iceberg`, `codec`, `config`, `internal`, `io`, `io/gocloud`, `puffin`, `table`, `table/{compaction,dv,internal,substrait}`, `view`, `view/internal`. - `go vet ./...` clean, `go build ./...` clean, LSP diagnostics clean on both touched files. ## Followup (out of scope) These tests are inherently brittle — they'll keep breaking on every arrow-go bump that nudges parquet metadata encoding. A future cleanup could replace exact byte assertions with bounds (`> 0`, monotonic relationships between added/removed/total) or assert on row counts only. Not addressing here to keep the diff minimal and unblock main.
The Avro wire schema declared
distinct_countson data_file v1 and v2, causing it to be emitted on every manifest entry. The Iceberg spec marks this field as "Deprecated. Do not write." (https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L667), so writers should not emit it. The field stands deprecated since 2020 -- apache/iceberg#767 (comment)Strict readers that don't include field-id 111 in their data_file read schema (e.g. PyIceberg 0.10) fail to resolve manifests written by this library with: ResolveError "File/read schema are not aligned for map, got None".