Skip to content

feat(table): project reserved row-lineage fields as null when the file lacks them#1045

Merged
laskoviymishka merged 6 commits into
apache:mainfrom
happydave1:feat/row-lineage
May 18, 2026
Merged

feat(table): project reserved row-lineage fields as null when the file lacks them#1045
laskoviymishka merged 6 commits into
apache:mainfrom
happydave1:feat/row-lineage

Conversation

@happydave1
Copy link
Copy Markdown
Contributor

fixes #1010.

@happydave1 happydave1 marked this pull request as ready for review May 9, 2026 17:03
@happydave1 happydave1 requested a review from zeroshade as a code owner May 9, 2026 17:03
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right shape of fix, deferring missing reserved metadata columns and re-attaching them after PruneColumns is a clean way to avoid pushing a “synthesize on missing” flag into the pruner.

A few things I’d fix before landing:

  1. The append path runs even when PruneColumns returns (nil, err), so that can nil-deref. Needs if err != nil { return nil, err } before continuing.
  2. This needs a format-version gate. On v1/v2 tables, Select(true, "_row_id") used to error; now it silently succeeds and injects a synthetic null field. Java gates this on v3 / row-lineage support, and PyIceberg still raises on missing names.
  3. isRowLineageMeta lowercases unconditionally, so Select(caseSensitive=true, "_Row_ID") now succeeds. Case-sensitive mode should reject that.
  4. The final NewSchema(prunedSchema.ID, ...) drops IdentifierFieldIDs. PruneColumns keeps them correctly, but the wrapper throws them away — should use NewSchemaWithIdentifiers.

Bigger-picture: I’d consider keeping Schema.Select as a pure structural prune, and doing the synthetic metadata-column join in Scan.Projection(), where the table format version is already in scope. Not blocking, but it keeps this concern out of unrelated Select callers.

Tests I’d add: case-sensitive negative case, v1/v2 negative case, and partial-presence case where one metadata column is already present and the other is synthesized.

Comment thread schema.go Outdated
return prunedSchema, err
}

return NewSchema(prunedSchema.ID, append(prunedSchema.fields, missingMetaFields...)...), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several concerns concentrated on this final return:

Blocking — nil-deref. When PruneColumns returns (nil, err) and missingMetaFields is non-empty, the early-return guard above is skipped and prunedSchema.ID / prunedSchema.fields panic. Need an if err != nil { return nil, err } before this line.

Drops IdentifierFieldIDs. NewSchema calls NewSchemaWithIdentifiers(id, []int{}, ...). PruneColumns filters identifiers correctly and the fast path returns them; this slow path discards them. Use NewSchemaWithIdentifiers(prunedSchema.ID, prunedSchema.IdentifierFieldIDs, ...).

Latent panic on duplicate names. If a caller passes Select(true, "_row_id", "_row_id"), missingMetaFields ends up with two RowID() entries (same field ID), and checkDuplicateFieldIDs inside NewSchemaWithIdentifiers panics. Select's doc says it returns errors — would dedupe missingMetaFields (or detect duplicates earlier and return ErrInvalidSchema).

Aliasing. append(prunedSchema.fields, ...) writes through the unexported slice. prunedSchema is fresh from PruneColumns so cap usually equals len, but if pruning ever pre-allocates extra cap, the append aliases. slices.Clone(prunedSchema.fields) makes the no-aliasing intent explicit.

Suggested shape:

prunedSchema, err := PruneColumns(s, ids, true)
if err != nil {
    return nil, err
}
if len(missingMetaFields) == 0 {
    return prunedSchema, nil
}
return NewSchemaWithIdentifiers(
    prunedSchema.ID,
    prunedSchema.IdentifierFieldIDs,
    append(slices.Clone(prunedSchema.fields), missingMetaFields...)...,
), nil

Comment thread schema.go
// then fields will be identified by case insensitive search.
//
// An error is returned if a requested name cannot be found.
func (s *Schema) Select(caseSensitive bool, names ...string) (*Schema, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecturally, would consider keeping Schema.Select as a pure structural prune and moving the row-lineage synthesis up into Scan.Projection() (in table/scanner.go).

Reasons:

  • Java keeps TypeUtil.select pure and joins the metadata columns in the scan builder (BaseSparkScanBuilder.projectionWithMetadataColumns); PyIceberg's Schema.select raises on missing names unconditionally and synthesizes meta columns at the scan layer. This PR diverges from both.
  • Schema.Select is called outside scan paths too (e.g. equality-delete writers). Teaching it about exactly two reserved names couples the schema utility to scan-layer concerns.
  • The version gate is naturally available in Projection() via metadata.Version(), which gives you the v3 check for free (see comment on the isRowLineageMeta call below).

Not blocking the change, but worth considering before this surface gets used by more callers.

Also: the doc comment two lines up still says "An error is returned if a requested name cannot be found." — that's no longer true for _row_id / _last_updated_sequence_number. Either move the logic or update the doc.

Comment thread schema.go Outdated
for _, n := range names {
id, ok := nameMap[n]
if !ok {
// check if v3 row lineage metadata
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says v3 but there's no version check anywhere in this path. On a v1 or v2 table, Select(_, "_row_id") previously returned ErrInvalidSchema; with this PR it silently succeeds and injects a synthetic null field — a contract change for older tables.

Java gates this on formatVersion >= 3 via TableUtil.supportsRowLineage, and Spark's scan builder only calls addLineageIfRequired when that check passes. Recommend either threading the format version into Select (ugly) or moving the synthesis into Scan.Projection() where metadata.Version() is in scope (clean — see other comment).

Comment thread metadata_columns.go Outdated
}

func isRowLineageMeta(name string) (NestedField, bool) {
name = strings.ToLower(name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional strings.ToLower short-circuits the case-sensitive branch in Schema.Select. A caller passing Select(true, "_Row_ID") should hit ErrInvalidSchema like any other miscased column; today it silently succeeds and gets a synthetic null _row_id field projected.

Would take a caseSensitive bool parameter and only fold when false:

func rowLineageMetaField(name string, caseSensitive bool) (NestedField, bool) {
    if !caseSensitive {
        name = strings.ToLower(name)
    }
    switch name {
    case RowIDColumnName:
        return RowID(), true
    case LastUpdatedSequenceNumberColumnName:
        return LastUpdatedSequenceNumber(), true
    }
    return NestedField{}, false
}

(The constants are already lowercase, so the case-sensitive branch can compare directly.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! will fix this!

Comment thread metadata_columns.go Outdated
return fieldID == RowIDFieldID || fieldID == LastUpdatedSequenceNumberFieldID
}

func isRowLineageMeta(name string) (NestedField, bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: isRowLineageMeta reads as a pure predicate but returns (NestedField, bool). Conventional Go names this rowLineageMetaField or findRowLineageMeta — matches the findX (T, bool) convention and reads better at the call sites in schema.go.

Comment thread table/scanner_internal_test.go Outdated
)
require.NoError(t, err)
assert.Equal(t, 3, metadata.Version(), "sanity: must be v3")
assert.GreaterOrEqual(t, metadata.NextRowID(), int64(0), "sanity: next-row-id must be set")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: >= 0 accepts 0, but the message says "next-row-id must be set." If the intent is just "non-negative," the message is fine reworded; if the intent is "actually initialized to a meaningful value," the assertion is too weak (a freshly-created v3 table can legitimately start at 0).

// _row_id and _last_updated_sequence_number as nullable (all-null-capable) fields when
// the table is v3 with next-row-id set but the data file predates row lineage (those
// columns are absent from the schema).
func TestProjectionV3PreLineageFile(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage is happy-path-only. Consider adding subtests for:

  1. caseSensitive: false — the case-insensitive branch in Select got new code and is untested.
  2. v1/v2 negative case — assert that Select(_, "_row_id") still returns ErrInvalidSchema on a v1/v2 schema (locks in the version gate once added).
  3. Partial presence — schema with _row_id already in it but _last_updated_sequence_number absent. This is also where the duplicate-ID panic in NewSchemaWithIdentifiers would surface if _row_id ends up in both ids (via the name map) and missingMetaFields.
  4. Panic-path coverage — a schema/ids combo that makes PruneColumns return (nil, err) while missingMetaFields is non-empty (the blocking nil-deref above).

A direct Schema.Select unit test in the iceberg package (schema_test.go) would also give clearer attribution than the end-to-end Projection() test, especially if the synthesis later moves out of Select.

Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
@happydave1 happydave1 requested a review from laskoviymishka May 13, 2026 17:00
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big restructure since the last round — and I think it’s the right direction. Moving the synthesis into Scan.Projection() and keeping Schema.Select as a pure prune is much cleaner. The v3 gate, case-sensitivity handling, and expanded tests also address the earlier concerns well.

I’d hold once more though.

The inline comment at schemaContainsMeta points to what looks like a real silent data-loss bug for any v3 table that legitimately declares _row_id. The check looks up the reserved field ID, but reassignIDs strips reserved IDs during NewMetadata, so detection always returns false. That means the real column gets stripped from the select list and replaced with synthetic null.

The new partial-presence tests look like they cover this, but I think they accidentally go through the synthesis path too — the rowIDField.ID == RowIDFieldID assertion is trivially true for the synthesized field. Once this is reworked, it’d be good to have a test that proves the user’s _row_id column actually survives, either by asserting on the field type or by reading data through it, not just checking the ID.

One other issue carries over from the previous round: the synthesis return still drops IdentifierFieldIDs.

Once those two are fixed, happy to take another pass and approve.

Comment thread table/scanner.go Outdated
return false
}

_, hasRowIdMeta := schema.FindFieldByID(iceberg.RowIDFieldID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a subtle issue here. FindFieldByID(iceberg.RowIDFieldID) basically never finds anything once a schema goes through NewMetadatareassignIDs calls AssignFreshSchemaIDs (schema.go:1366), which renumbers every field starting from 1, so a schema passed in with iceberg.RowID() ends up with _row_id at id=3, not at MaxInt32-107.

The practical consequence on any v3 table that legitimately declares _row_id in its schema: this returns false → synthesis path is taken → removeMetadataFromSelectedFields strips _row_id from the select list → a synthetic all-null _row_id is appended at the reserved id. The user's real _row_id column is silently dropped from the projection.

The …SchemaWithRowIDOnly / …WithLastUpdatedSequenceNumberOnly tests look like they cover this case but actually go down the synthesis path too — their rowIDField.ID == RowIDFieldID assertion is trivially true for the synthesized field, so they pass without proving the user's column survived.

A couple of ways to handle: look up by name (FindFieldByName(iceberg.RowIDColumnName)) instead of reserved id, or have removeMetadataFromSelectedFields skip the strip when the schema already has a field with that name. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I did not know that fields get reassembled whenever a schema goes through NewMetadata!

I think the clean fix would be to just use FindFieldByName instead.

Comment thread table/scanner.go Outdated
return nil, err
}

return iceberg.NewSchema(sch.ID, append(sch.Fields(), missingMetaFields...)...), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carrying over from the previous round — iceberg.NewSchema calls NewSchemaWithIdentifiers(id, []int{}, ...), so IdentifierFieldIDs get dropped here. sch from Select carries them correctly; the wrap throws them away. NewSchemaWithIdentifiers(sch.ID, sch.IdentifierFieldIDs, append(sch.Fields(), missingMetaFields...)...) would preserve them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I totally forgot about this fix because of the restructure. Thanks for reminding!

Signed-off-by: happydave1 <dzhao2004@gmail.com>
@happydave1
Copy link
Copy Markdown
Contributor Author

@laskoviymishka Thanks for the previous review - the issues spotted were definitely good catches. Also thanks for the clear and precise communication - it made fixing the issues a lot easier.

Another restructure since last time.

I found that once I fixed the bug caused by using FindFieldByName(), the partial cases were failing since all row lineage metadata was being stripped from selectedFields, thus leading to lost data - Good catch!

So this round I focused on making sure only metadata which needed to be synthesized is properly synthesized. Now the synthesis path (scanner.go:271) is only gated by the version and whether or not any metadata has to be synthesized. I think that this approach is much cleaner and clearer.

I also fixed the carry-over issue of sch.IndentifierFieldIDs being dropped by using iceberg.NewSchemaWithIdentifiers.

Please let me know what you think, any feedback would be greatly appreciated. Thanks!

Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

Set-difference framing reads clean, FindFieldByName correctly survives the AssignFreshSchemaIDs renumbering, NewSchemaWithIdentifiers preserves identifier ids, and the new NotEqual(RowIDFieldID, …) assertion would have caught the round-2 data-loss regression.

LGTM.

Left one inline, not blocking, just so the divergence stays traceable, let's file an issue.


rowIDField, ok := fieldByName[iceberg.RowIDColumnName]
require.True(t, ok)
assert.NotEqual(t, iceberg.RowIDFieldID, rowIDField.ID) // NewMetadata reorders schema field numbers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment captures the symptom, but the root cause is upstream — AssignFreshSchemaIDs (schema.go:1366) renumbers reserved metadata column IDs along with everything else. Concrete consequence outside this PR's scope: when a v3 table's _row_id ends up at id=3 in the projection, arrowAccessor.FieldPartner (table/arrow_utils.go:677) will look up id=3 in file schemas — Java-written files have _row_id at id=MaxInt32-107, so we'd silently read whichever column the Java file has at id=3 instead.

Not asking to fix in this PR — would you mind filing a follow-up issue so the divergence is traceable? Either special-casing reserved ids in AssignFreshSchemaIDs or having NewMetadata reject metadata columns in the user schema (matching Java) are both reasonable directions to discuss there.

@laskoviymishka laskoviymishka merged commit 1a44611 into apache:main May 18, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(table): project reserved row-lineage fields as null when the file lacks them

2 participants