fix(manifest): stop writing deprecated distinct_counts (field-id 111) by dhananjaykrutika · Pull Request #1102 · apache/iceberg-go

dhananjaykrutika · 2026-05-20T10:55:57Z

The Avro wire schema declared distinct_counts on data_file v1 and v2, causing it to be emitted on every manifest entry. The Iceberg spec marks this field as "Deprecated. Do not write." (https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L667), so writers should not emit it. The field stands deprecated since 2020 -- apache/iceberg#767 (comment)

Strict readers that don't include field-id 111 in their data_file read schema (e.g. PyIceberg 0.10) fail to resolve manifests written by this library with: ResolveError "File/read schema are not aligned for map, got None".

dhananjaykrutika · 2026-05-21T08:50:17Z

The integration test failure comes from PR #355. The change at #1105 fixes this. So this test should pass once #1105 is merged.

twuebi · 2026-05-21T12:07:30Z

cc @laskoviymishka

laskoviymishka

Nice, narrow fix, dropping field 111 from the v1/v2 Avro schemas is the right enforcement layer, and the PyIceberg 0.10 unblock is a very real reason to do it.

I’d hold merge for a couple of cleanup points though. Main one: removing TestWriteManifestV{1,2}KeepsDistinctCounts entirely feels a bit overkill :D

We no longer want to write distinct_counts, but we should still be able to read legacy manifests that already have field 111 on the wire. The dataFile struct still has the Avro tag, so older iceberg-go / Spark-written manifests should decode fine — but after deleting those tests, nothing pins that behavior.

I’d add a small back-compat test that bypasses WriteManifest, builds/uses a raw OCF fixture with the old schema and field 111 present, then reads it through ReadManifest and asserts the map comes back.

A few smaller things before merge:

update MarshalAvroEntry / EncodeDataFile godoc: distinct_counts is dropped on encode for all versions now, not just v3
rename/broaden TestWriteManifestV3OmitsDistinctCounts to cover v1/v2/v3
either remove the v3-only prepareEntry special case, or mirror the nil-clear for v1/v2 too
add // Deprecated: godoc on DataFileBuilder.DistinctValueCounts, since it’s now a public setter that won’t encode

Fix itself looks right — I’d just keep the legacy read coverage instead of deleting the whole old test shape.

The Avro wire schema declared distinct_counts on data_file v1 and v2, causing it to be emitted on every manifest entry. The Iceberg spec marks this field as "Deprecated. Do not write." (apache/iceberg format/spec.md), so writers should not emit it. Strict readers that don't include field-id 111 in their data_file read schema (e.g. PyIceberg 0.10) fail to resolve manifests written by this library with: ResolveError "File/read schema are not aligned for map, got None".

PR apache#1075's TestEncodeDecodeDataFileRoundTrip fixture populated distinct_counts on v1/v2 and asserted the field round-tripped, on the premise that v1/v2 manifest-entry schemas declared field 111. Update the test to match: populate DistinctValueCounts on every version's fixture and assert it is empty after round-trip on every version. The assertion now serves as a regression guard for the intended behavior -- manifests written by this library never carry field 111, regardless of what the source DataFile holds in memory.

zeroshade · 2026-05-21T16:25:37Z

@dhananjaykrutika I hope you don't mind that I'm gonna address @laskoviymishka's comments so we can get this across the finish line so that we can create a new RC as this issue is causing us to have an incompatibility with pyiceberg

laskoviymishka

🚢 🫰

… v18.6.0 (#1114) Restores main CI to green by reverting only the test-expectation byte counts that #1102 ("fix(manifest): stop writing deprecated distinct_counts") inadvertently changed. ## What broke After #1102 merged to main, both the `Go` and `Audit and Verify` workflows started failing on every push, across all 6 matrix entries (ubuntu/windows/macos × Go 1.25.5/1.26.1): ``` --- FAIL: TestTableWriting (10.16s) --- FAIL: TestTableWriting/TestAddFilesPartitionedTable --- FAIL: TestTableWriting/TestAddFilesUnpartitioned --- FAIL: TestTableWriting/TestReplaceDataFiles --- FAIL: TestTableWriting/TestAddFilesPartitionedTable#01 --- FAIL: TestTableWriting/TestAddFilesUnpartitioned#01 --- FAIL: TestTableWriting/TestReplaceDataFiles#01 --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch (0.00s) --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/success --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/batch_with_records_having_different_file_paths ``` with diffs like: ``` - "added-files-size": "3590" (actual on CI / arrow-go v18.6.0) + "added-files-size": "3070" (test source, post-#1102) - ColumnSizes: 2147483545:88 (actual on CI / arrow-go v18.6.0) + ColumnSizes: 2147483545:86 (test source, post-#1102) ``` ## Root cause These two test files assert on **exact byte counts** of the on-disk parquet files written by `arrow-go`. Those byte counts depend on the parquet writer's metadata encoding, which differs across arrow-go versions. #1102 lowered the expected counts (3590→3070, 1066→963, 1816→2132+→1816, 4264→3687, 88→86, 174→172, 96→94, 187→185). Those new values only reproduce against an unreleased arrow-go build (presumably wired in via a local `go.work`). Against the pinned `github.com/apache/arrow-go/v18 v18.6.0` from `go.sum` — which every CI runner resolves — the writer still emits the original larger sizes, so the assertions fail every time. The PR's own CI runs were `CANCELLED` before completion, so this didn't surface at merge time. The functional change in #1102 (dropping `distinct_counts` field-id 111 from the manifest entry Avro schema) only alters **manifest** serialization, not the **data** parquet files whose sizes these `Summary`/`ColumnSizes` fields tally — so the test-value updates were always unrelated to the production fix and can be safely reverted on their own. ## Fix Revert only the eight byte-count lines back to the pre-#1102 values verified via `git show 51b3140^`: | File | Test | Field | Before | #1102 | This PR | |---|---|---|---|---|---| | `table/table_test.go:549,554` | `TestAddFilesUnpartitioned` | `added-/total-files-size` | 3590 | 3070 | **3590** | | `table/table_test.go:770,776` | `TestAddFilesPartitionedTable` | `added-/total-files-size` | 3590 | 3070 | **3590** | | `table/table_test.go:1136,1140,1144` | `TestReplaceDataFiles` | `added-/removed-/total-files-size` | 1066/2132/4264 | 963/1816/3687 | **1066/2132/4264** | | `table/pos_delete_partitioned_fanout_writer_test.go:77` | `success` | `ColumnSizes` | 88,174 | 86,172 | **88,174** | | `table/pos_delete_partitioned_fanout_writer_test.go:87` | `batch_with_records_having_different_file_paths` | `ColumnSizes` | 96,187 | 94,185 | **96,187** | No production code changes — #1102's manifest fix is preserved intact. ## Verification Reproduced CI failure locally with `GOWORK=off go test ./...` (forces use of pinned v18.6.0): - **Before this PR**: same 8 sub-test failures with identical expected/actual diffs as CI. - **After this PR**: `ok` across every package — `iceberg-go`, `catalog/{glue,hadoop,hive,rest,sql}`, `cmd/iceberg`, `codec`, `config`, `internal`, `io`, `io/gocloud`, `puffin`, `table`, `table/{compaction,dv,internal,substrait}`, `view`, `view/internal`. - `go vet ./...` clean, `go build ./...` clean, LSP diagnostics clean on both touched files. ## Followup (out of scope) These tests are inherently brittle — they'll keep breaking on every arrow-go bump that nudges parquet metadata encoding. A future cleanup could replace exact byte assertions with bounds (`> 0`, monotonic relationships between added/removed/total) or assert on row counts only. Not addressing here to keep the diff minimal and unblock main.

dhananjaykrutika requested a review from zeroshade as a code owner May 20, 2026 10:55

laskoviymishka requested changes May 21, 2026

View reviewed changes

Krutika Dhananjay added 4 commits May 21, 2026 12:13

test(manifest): drop v1/v2 distinct_counts tests

d6c7855

fix lint error

00585f4

zeroshade force-pushed the drop-deprecated-distinct-counts branch from 6deb071 to 853eb32 Compare May 21, 2026 16:13

address the review comments

7d3aaf2

laskoviymishka self-requested a review May 21, 2026 16:50

laskoviymishka approved these changes May 21, 2026

View reviewed changes

laskoviymishka merged commit 51b3140 into apache:main May 21, 2026
5 of 14 checks passed

happydave1 mentioned this pull request May 21, 2026

feat(table): Adding geometry and geography type + schema plumbing #984

Open

zeroshade mentioned this pull request May 21, 2026

fix(table): restore Summary/ColumnSizes test expectations to arrow-go v18.6.0 #1114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(manifest): stop writing deprecated distinct_counts (field-id 111)#1102

fix(manifest): stop writing deprecated distinct_counts (field-id 111)#1102
laskoviymishka merged 5 commits into
apache:mainfrom
dhananjaykrutika:drop-deprecated-distinct-counts

dhananjaykrutika commented May 20, 2026 •

edited

Loading

Uh oh!

dhananjaykrutika commented May 21, 2026

Uh oh!

twuebi commented May 21, 2026

Uh oh!

laskoviymishka left a comment

Uh oh!

zeroshade commented May 21, 2026

Uh oh!

laskoviymishka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dhananjaykrutika commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhananjaykrutika commented May 21, 2026

Uh oh!

twuebi commented May 21, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

zeroshade commented May 21, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dhananjaykrutika commented May 20, 2026 •

edited

Loading