Skip to content

fix(manifest): stop writing deprecated distinct_counts (field-id 111)#1102

Merged
laskoviymishka merged 5 commits into
apache:mainfrom
dhananjaykrutika:drop-deprecated-distinct-counts
May 21, 2026
Merged

fix(manifest): stop writing deprecated distinct_counts (field-id 111)#1102
laskoviymishka merged 5 commits into
apache:mainfrom
dhananjaykrutika:drop-deprecated-distinct-counts

Conversation

@dhananjaykrutika
Copy link
Copy Markdown
Contributor

@dhananjaykrutika dhananjaykrutika commented May 20, 2026

The Avro wire schema declared distinct_counts on data_file v1 and v2, causing it to be emitted on every manifest entry. The Iceberg spec marks this field as "Deprecated. Do not write." (https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L667), so writers should not emit it. The field stands deprecated since 2020 -- apache/iceberg#767 (comment)

Strict readers that don't include field-id 111 in their data_file read schema (e.g. PyIceberg 0.10) fail to resolve manifests written by this library with: ResolveError "File/read schema are not aligned for map, got None".

@dhananjaykrutika
Copy link
Copy Markdown
Contributor Author

The integration test failure comes from PR #355. The change at #1105 fixes this. So this test should pass once #1105 is merged.

@twuebi
Copy link
Copy Markdown
Contributor

twuebi commented May 21, 2026

cc @laskoviymishka

Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, narrow fix, dropping field 111 from the v1/v2 Avro schemas is the right enforcement layer, and the PyIceberg 0.10 unblock is a very real reason to do it.

I’d hold merge for a couple of cleanup points though. Main one: removing TestWriteManifestV{1,2}KeepsDistinctCounts entirely feels a bit overkill :D

We no longer want to write distinct_counts, but we should still be able to read legacy manifests that already have field 111 on the wire. The dataFile struct still has the Avro tag, so older iceberg-go / Spark-written manifests should decode fine — but after deleting those tests, nothing pins that behavior.

I’d add a small back-compat test that bypasses WriteManifest, builds/uses a raw OCF fixture with the old schema and field 111 present, then reads it through ReadManifest and asserts the map comes back.

A few smaller things before merge:

  • update MarshalAvroEntry / EncodeDataFile godoc: distinct_counts is dropped on encode for all versions now, not just v3
  • rename/broaden TestWriteManifestV3OmitsDistinctCounts to cover v1/v2/v3
  • either remove the v3-only prepareEntry special case, or mirror the nil-clear for v1/v2 too
  • add // Deprecated: godoc on DataFileBuilder.DistinctValueCounts, since it’s now a public setter that won’t encode

Fix itself looks right — I’d just keep the legacy read coverage instead of deleting the whole old test shape.

Krutika Dhananjay added 4 commits May 21, 2026 12:13
The Avro wire schema declared distinct_counts on data_file v1 and v2,
causing it to be emitted on every manifest entry. The Iceberg spec
marks this field as "Deprecated. Do not write."
(apache/iceberg format/spec.md), so writers should not emit it.

Strict readers that don't include field-id 111 in their data_file read
schema (e.g. PyIceberg 0.10) fail to resolve manifests written by this
library with: ResolveError "File/read schema are not aligned for map,
got None".
PR apache#1075's TestEncodeDecodeDataFileRoundTrip fixture populated
distinct_counts on v1/v2 and asserted the field round-tripped, on
the premise that v1/v2 manifest-entry schemas declared field 111.

Update the test to match: populate DistinctValueCounts on every
version's fixture and assert it is empty after round-trip on every
version. The assertion now serves as a regression guard for the
intended behavior -- manifests written by this library never carry
field 111, regardless of what the source DataFile holds in memory.
@zeroshade zeroshade force-pushed the drop-deprecated-distinct-counts branch from 6deb071 to 853eb32 Compare May 21, 2026 16:13
@zeroshade
Copy link
Copy Markdown
Member

@dhananjaykrutika I hope you don't mind that I'm gonna address @laskoviymishka's comments so we can get this across the finish line so that we can create a new RC as this issue is causing us to have an incompatibility with pyiceberg

@laskoviymishka laskoviymishka self-requested a review May 21, 2026 16:50
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🫰

@laskoviymishka laskoviymishka merged commit 51b3140 into apache:main May 21, 2026
5 of 14 checks passed
zeroshade added a commit that referenced this pull request May 21, 2026
… v18.6.0 (#1114)

Restores main CI to green by reverting only the test-expectation byte
counts that #1102 ("fix(manifest): stop writing deprecated
distinct_counts") inadvertently changed.

## What broke

After #1102 merged to main, both the `Go` and `Audit and Verify`
workflows started failing on every push, across all 6 matrix entries
(ubuntu/windows/macos × Go 1.25.5/1.26.1):

```
--- FAIL: TestTableWriting (10.16s)
    --- FAIL: TestTableWriting/TestAddFilesPartitionedTable
    --- FAIL: TestTableWriting/TestAddFilesUnpartitioned
    --- FAIL: TestTableWriting/TestReplaceDataFiles
    --- FAIL: TestTableWriting/TestAddFilesPartitionedTable#01
    --- FAIL: TestTableWriting/TestAddFilesUnpartitioned#01
    --- FAIL: TestTableWriting/TestReplaceDataFiles#01
--- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch (0.00s)
    --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/success
    --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/batch_with_records_having_different_file_paths
```

with diffs like:

```
- "added-files-size": "3590"   (actual on CI / arrow-go v18.6.0)
+ "added-files-size": "3070"   (test source, post-#1102)

- ColumnSizes: 2147483545:88   (actual on CI / arrow-go v18.6.0)
+ ColumnSizes: 2147483545:86   (test source, post-#1102)
```

## Root cause

These two test files assert on **exact byte counts** of the on-disk
parquet files written by `arrow-go`. Those byte counts depend on the
parquet writer's metadata encoding, which differs across arrow-go
versions.

#1102 lowered the expected counts (3590→3070, 1066→963, 1816→2132+→1816,
4264→3687, 88→86, 174→172, 96→94, 187→185). Those new values only
reproduce against an unreleased arrow-go build (presumably wired in via
a local `go.work`). Against the pinned `github.com/apache/arrow-go/v18
v18.6.0` from `go.sum` — which every CI runner resolves — the writer
still emits the original larger sizes, so the assertions fail every
time.

The PR's own CI runs were `CANCELLED` before completion, so this didn't
surface at merge time.

The functional change in #1102 (dropping `distinct_counts` field-id 111
from the manifest entry Avro schema) only alters **manifest**
serialization, not the **data** parquet files whose sizes these
`Summary`/`ColumnSizes` fields tally — so the test-value updates were
always unrelated to the production fix and can be safely reverted on
their own.

## Fix

Revert only the eight byte-count lines back to the pre-#1102 values
verified via `git show 51b3140^`:

| File | Test | Field | Before | #1102 | This PR |
|---|---|---|---|---|---|
| `table/table_test.go:549,554` | `TestAddFilesUnpartitioned` |
`added-/total-files-size` | 3590 | 3070 | **3590** |
| `table/table_test.go:770,776` | `TestAddFilesPartitionedTable` |
`added-/total-files-size` | 3590 | 3070 | **3590** |
| `table/table_test.go:1136,1140,1144` | `TestReplaceDataFiles` |
`added-/removed-/total-files-size` | 1066/2132/4264 | 963/1816/3687 |
**1066/2132/4264** |
| `table/pos_delete_partitioned_fanout_writer_test.go:77` | `success` |
`ColumnSizes` | 88,174 | 86,172 | **88,174** |
| `table/pos_delete_partitioned_fanout_writer_test.go:87` |
`batch_with_records_having_different_file_paths` | `ColumnSizes` |
96,187 | 94,185 | **96,187** |

No production code changes — #1102's manifest fix is preserved intact.

## Verification

Reproduced CI failure locally with `GOWORK=off go test ./...` (forces
use of pinned v18.6.0):

- **Before this PR**: same 8 sub-test failures with identical
expected/actual diffs as CI.
- **After this PR**: `ok` across every package — `iceberg-go`,
`catalog/{glue,hadoop,hive,rest,sql}`, `cmd/iceberg`, `codec`, `config`,
`internal`, `io`, `io/gocloud`, `puffin`, `table`,
`table/{compaction,dv,internal,substrait}`, `view`, `view/internal`.
- `go vet ./...` clean, `go build ./...` clean, LSP diagnostics clean on
both touched files.

## Followup (out of scope)

These tests are inherently brittle — they'll keep breaking on every
arrow-go bump that nudges parquet metadata encoding. A future cleanup
could replace exact byte assertions with bounds (`> 0`, monotonic
relationships between added/removed/total) or assert on row counts only.
Not addressing here to keep the diff minimal and unblock main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants