feat(table): project reserved row-lineage fields as null when the file lacks them by happydave1 · Pull Request #1045 · apache/iceberg-go

happydave1 · 2026-05-09T16:51:50Z

laskoviymishka

Right shape of fix, deferring missing reserved metadata columns and re-attaching them after PruneColumns is a clean way to avoid pushing a “synthesize on missing” flag into the pruner.

A few things I’d fix before landing:

The append path runs even when PruneColumns returns (nil, err), so that can nil-deref. Needs if err != nil { return nil, err } before continuing.
This needs a format-version gate. On v1/v2 tables, Select(true, "_row_id") used to error; now it silently succeeds and injects a synthetic null field. Java gates this on v3 / row-lineage support, and PyIceberg still raises on missing names.
isRowLineageMeta lowercases unconditionally, so Select(caseSensitive=true, "_Row_ID") now succeeds. Case-sensitive mode should reject that.
The final NewSchema(prunedSchema.ID, ...) drops IdentifierFieldIDs. PruneColumns keeps them correctly, but the wrapper throws them away — should use NewSchemaWithIdentifiers.

Bigger-picture: I’d consider keeping Schema.Select as a pure structural prune, and doing the synthetic metadata-column join in Scan.Projection(), where the table format version is already in scope. Not blocking, but it keeps this concern out of unrelated Select callers.

Tests I’d add: case-sensitive negative case, v1/v2 negative case, and partial-presence case where one metadata column is already present and the other is synthesized.

laskoviymishka · 2026-05-10T20:24:02Z

+		return prunedSchema, err
+	}
+
+	return NewSchema(prunedSchema.ID, append(prunedSchema.fields, missingMetaFields...)...), nil


Several concerns concentrated on this final return:

Blocking — nil-deref. When PruneColumns returns (nil, err) and missingMetaFields is non-empty, the early-return guard above is skipped and prunedSchema.ID / prunedSchema.fields panic. Need an if err != nil { return nil, err } before this line.

Drops IdentifierFieldIDs. NewSchema calls NewSchemaWithIdentifiers(id, []int{}, ...). PruneColumns filters identifiers correctly and the fast path returns them; this slow path discards them. Use NewSchemaWithIdentifiers(prunedSchema.ID, prunedSchema.IdentifierFieldIDs, ...).

Latent panic on duplicate names. If a caller passes Select(true, "_row_id", "_row_id"), missingMetaFields ends up with two RowID() entries (same field ID), and checkDuplicateFieldIDs inside NewSchemaWithIdentifiers panics. Select's doc says it returns errors — would dedupe missingMetaFields (or detect duplicates earlier and return ErrInvalidSchema).

Aliasing. append(prunedSchema.fields, ...) writes through the unexported slice. prunedSchema is fresh from PruneColumns so cap usually equals len, but if pruning ever pre-allocates extra cap, the append aliases. slices.Clone(prunedSchema.fields) makes the no-aliasing intent explicit.

Suggested shape:

prunedSchema, err := PruneColumns(s, ids, true) if err != nil { return nil, err } if len(missingMetaFields) == 0 { return prunedSchema, nil } return NewSchemaWithIdentifiers( prunedSchema.ID, prunedSchema.IdentifierFieldIDs, append(slices.Clone(prunedSchema.fields), missingMetaFields...)..., ), nil

laskoviymishka · 2026-05-10T20:24:02Z

 // then fields will be identified by case insensitive search.
 //
 // An error is returned if a requested name cannot be found.
 func (s *Schema) Select(caseSensitive bool, names ...string) (*Schema, error) {


Architecturally, would consider keeping Schema.Select as a pure structural prune and moving the row-lineage synthesis up into Scan.Projection() (in table/scanner.go).

Reasons:

Java keeps TypeUtil.select pure and joins the metadata columns in the scan builder (BaseSparkScanBuilder.projectionWithMetadataColumns); PyIceberg's Schema.select raises on missing names unconditionally and synthesizes meta columns at the scan layer. This PR diverges from both.

Schema.Select is called outside scan paths too (e.g. equality-delete writers). Teaching it about exactly two reserved names couples the schema utility to scan-layer concerns.

The version gate is naturally available in Projection() via metadata.Version(), which gives you the v3 check for free (see comment on the isRowLineageMeta call below).

Not blocking the change, but worth considering before this surface gets used by more callers.

Also: the doc comment two lines up still says "An error is returned if a requested name cannot be found." — that's no longer true for _row_id / _last_updated_sequence_number. Either move the logic or update the doc.

laskoviymishka · 2026-05-10T20:24:02Z

 		for _, n := range names {
 			id, ok := nameMap[n]
 			if !ok {
+				// check if v3 row lineage metadata


Comment says v3 but there's no version check anywhere in this path. On a v1 or v2 table, Select(_, "_row_id") previously returned ErrInvalidSchema; with this PR it silently succeeds and injects a synthetic null field — a contract change for older tables.

Java gates this on formatVersion >= 3 via TableUtil.supportsRowLineage, and Spark's scan builder only calls addLineageIfRequired when that check passes. Recommend either threading the format version into Select (ugly) or moving the synthesis into Scan.Projection() where metadata.Version() is in scope (clean — see other comment).

laskoviymishka · 2026-05-10T20:24:02Z

 }
+
+func isRowLineageMeta(name string) (NestedField, bool) {
+	name = strings.ToLower(name)


Unconditional strings.ToLower short-circuits the case-sensitive branch in Schema.Select. A caller passing Select(true, "_Row_ID") should hit ErrInvalidSchema like any other miscased column; today it silently succeeds and gets a synthetic null _row_id field projected.

Would take a caseSensitive bool parameter and only fold when false:

func rowLineageMetaField(name string, caseSensitive bool) (NestedField, bool) { if !caseSensitive { name = strings.ToLower(name) } switch name { case RowIDColumnName: return RowID(), true case LastUpdatedSequenceNumberColumnName: return LastUpdatedSequenceNumber(), true } return NestedField{}, false }

(The constants are already lowercase, so the case-sensitive branch can compare directly.)

Good catch! will fix this!

laskoviymishka · 2026-05-10T20:24:02Z

 	return fieldID == RowIDFieldID || fieldID == LastUpdatedSequenceNumberFieldID
 }
+
+func isRowLineageMeta(name string) (NestedField, bool) {


Nit: isRowLineageMeta reads as a pure predicate but returns (NestedField, bool). Conventional Go names this rowLineageMetaField or findRowLineageMeta — matches the findX (T, bool) convention and reads better at the call sites in schema.go.

laskoviymishka · 2026-05-10T20:24:02Z

+	)
+	require.NoError(t, err)
+	assert.Equal(t, 3, metadata.Version(), "sanity: must be v3")
+	assert.GreaterOrEqual(t, metadata.NextRowID(), int64(0), "sanity: next-row-id must be set")


Nit: >= 0 accepts 0, but the message says "next-row-id must be set." If the intent is just "non-negative," the message is fine reworded; if the intent is "actually initialized to a meaningful value," the assertion is too weak (a freshly-created v3 table can legitimately start at 0).

laskoviymishka · 2026-05-10T20:24:02Z

+// _row_id and _last_updated_sequence_number as nullable (all-null-capable) fields when
+// the table is v3 with next-row-id set but the data file predates row lineage (those
+// columns are absent from the schema).
+func TestProjectionV3PreLineageFile(t *testing.T) {


Test coverage is happy-path-only. Consider adding subtests for:

caseSensitive: false — the case-insensitive branch in Select got new code and is untested.

v1/v2 negative case — assert that Select(_, "_row_id") still returns ErrInvalidSchema on a v1/v2 schema (locks in the version gate once added).

Partial presence — schema with _row_id already in it but _last_updated_sequence_number absent. This is also where the duplicate-ID panic in NewSchemaWithIdentifiers would surface if _row_id ends up in both ids (via the name map) and missingMetaFields.

Panic-path coverage — a schema/ids combo that makes PruneColumns return (nil, err) while missingMetaFields is non-empty (the blocking nil-deref above).

A direct Schema.Select unit test in the iceberg package (schema_test.go) would also give clearer attribution than the end-to-end Projection() test, especially if the synthesis later moves out of Select.

Signed-off-by: happydave1 <dzhao2004@gmail.com>

laskoviymishka

Big restructure since the last round — and I think it’s the right direction. Moving the synthesis into Scan.Projection() and keeping Schema.Select as a pure prune is much cleaner. The v3 gate, case-sensitivity handling, and expanded tests also address the earlier concerns well.

I’d hold once more though.

The inline comment at schemaContainsMeta points to what looks like a real silent data-loss bug for any v3 table that legitimately declares _row_id. The check looks up the reserved field ID, but reassignIDs strips reserved IDs during NewMetadata, so detection always returns false. That means the real column gets stripped from the select list and replaced with synthetic null.

The new partial-presence tests look like they cover this, but I think they accidentally go through the synthesis path too — the rowIDField.ID == RowIDFieldID assertion is trivially true for the synthesized field. Once this is reworked, it’d be good to have a test that proves the user’s _row_id column actually survives, either by asserting on the field type or by reading data through it, not just checking the ID.

One other issue carries over from the previous round: the synthesis return still drops IdentifierFieldIDs.

Once those two are fixed, happy to take another pass and approve.

laskoviymishka · 2026-05-13T19:52:45Z

+		return false
+	}
+
+	_, hasRowIdMeta := schema.FindFieldByID(iceberg.RowIDFieldID)


I think there's a subtle issue here. FindFieldByID(iceberg.RowIDFieldID) basically never finds anything once a schema goes through NewMetadata — reassignIDs calls AssignFreshSchemaIDs (schema.go:1366), which renumbers every field starting from 1, so a schema passed in with iceberg.RowID() ends up with _row_id at id=3, not at MaxInt32-107.

The practical consequence on any v3 table that legitimately declares _row_id in its schema: this returns false → synthesis path is taken → removeMetadataFromSelectedFields strips _row_id from the select list → a synthetic all-null _row_id is appended at the reserved id. The user's real _row_id column is silently dropped from the projection.

The …SchemaWithRowIDOnly / …WithLastUpdatedSequenceNumberOnly tests look like they cover this case but actually go down the synthesis path too — their rowIDField.ID == RowIDFieldID assertion is trivially true for the synthesized field, so they pass without proving the user's column survived.

A couple of ways to handle: look up by name (FindFieldByName(iceberg.RowIDColumnName)) instead of reserved id, or have removeMetadataFromSelectedFields skip the strip when the schema already has a field with that name. wdyt?

Nice catch! I did not know that fields get reassembled whenever a schema goes through NewMetadata!

I think the clean fix would be to just use FindFieldByName instead.

laskoviymishka · 2026-05-13T19:52:45Z

+			return nil, err
+		}
+
+		return iceberg.NewSchema(sch.ID, append(sch.Fields(), missingMetaFields...)...), nil


Carrying over from the previous round — iceberg.NewSchema calls NewSchemaWithIdentifiers(id, []int{}, ...), so IdentifierFieldIDs get dropped here. sch from Select carries them correctly; the wrap throws them away. NewSchemaWithIdentifiers(sch.ID, sch.IdentifierFieldIDs, append(sch.Fields(), missingMetaFields...)...) would preserve them.

Good catch - I totally forgot about this fix because of the restructure. Thanks for reminding!

Signed-off-by: happydave1 <dzhao2004@gmail.com>

happydave1 · 2026-05-18T20:53:52Z

@laskoviymishka Thanks for the previous review - the issues spotted were definitely good catches. Also thanks for the clear and precise communication - it made fixing the issues a lot easier.

Another restructure since last time.

I found that once I fixed the bug caused by using FindFieldByName(), the partial cases were failing since all row lineage metadata was being stripped from selectedFields, thus leading to lost data - Good catch!

So this round I focused on making sure only metadata which needed to be synthesized is properly synthesized. Now the synthesis path (scanner.go:271) is only gated by the version and whether or not any metadata has to be synthesized. I think that this approach is much cleaner and clearer.

I also fixed the carry-over issue of sch.IndentifierFieldIDs being dropped by using iceberg.NewSchemaWithIdentifiers.

Please let me know what you think, any feedback would be greatly appreciated. Thanks!

laskoviymishka

Nice work.

Set-difference framing reads clean, FindFieldByName correctly survives the AssignFreshSchemaIDs renumbering, NewSchemaWithIdentifiers preserves identifier ids, and the new NotEqual(RowIDFieldID, …) assertion would have caught the round-2 data-loss regression.

LGTM.

Left one inline, not blocking, just so the divergence stays traceable, let's file an issue.

laskoviymishka · 2026-05-18T21:35:07Z

+
+	rowIDField, ok := fieldByName[iceberg.RowIDColumnName]
+	require.True(t, ok)
+	assert.NotEqual(t, iceberg.RowIDFieldID, rowIDField.ID) // NewMetadata reorders schema field numbers


The comment captures the symptom, but the root cause is upstream — AssignFreshSchemaIDs (schema.go:1366) renumbers reserved metadata column IDs along with everything else. Concrete consequence outside this PR's scope: when a v3 table's _row_id ends up at id=3 in the projection, arrowAccessor.FieldPartner (table/arrow_utils.go:677) will look up id=3 in file schemas — Java-written files have _row_id at id=MaxInt32-107, so we'd silently read whichever column the Java file has at id=3 instead.

Not asking to fix in this PR — would you mind filing a follow-up issue so the divergence is traceable? Either special-casing reserved ids in AssignFreshSchemaIDs or having NewMetadata reject metadata columns in the user schema (matching Java) are both reasonable directions to discuss there.

happydave1 marked this pull request as ready for review May 9, 2026 17:03

happydave1 requested a review from zeroshade as a code owner May 9, 2026 17:03

laskoviymishka requested changes May 10, 2026

View reviewed changes

happydave1 added 3 commits May 13, 2026 01:21

projecting row lineage fields as null

404a74f

Signed-off-by: happydave1 <dzhao2004@gmail.com>

fix lint

4cc3f9b

Signed-off-by: happydave1 <dzhao2004@gmail.com>

updating structure of row lineage synthesis

d3fb3c4

Signed-off-by: happydave1 <dzhao2004@gmail.com>

happydave1 force-pushed the feat/row-lineage branch from 7f90277 to d3fb3c4 Compare May 13, 2026 08:26

happydave1 added 2 commits May 13, 2026 01:26

fix lint

c0f829f

Signed-off-by: happydave1 <dzhao2004@gmail.com>

adding Select unit tests

ac6af8d

Signed-off-by: happydave1 <dzhao2004@gmail.com>

happydave1 requested a review from laskoviymishka May 13, 2026 17:00

laskoviymishka requested changes May 13, 2026

View reviewed changes

updating structure

f072a7a

Signed-off-by: happydave1 <dzhao2004@gmail.com>

laskoviymishka approved these changes May 18, 2026

View reviewed changes

laskoviymishka merged commit 1a44611 into apache:main May 18, 2026
14 checks passed

Conversation

happydave1 commented May 9, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

happydave1 commented May 18, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants