Skip to content

Iceberg changelog scan returns wrong data after column rename / drop-then-add #2375

Description

@lyne7-sc

Describe the bug

Auron already resolves Iceberg columns by field-id in the regular native Iceberg scan path, which makes top-level schema evolution such as column rename and drop-then-add safe for Parquet files.

The newer insert-only Iceberg changelog scan path reuses the native file reader, but it does not pass the same Iceberg field-id mapping into the native scan plan yet.
As a result, native Parquet schema matching falls back to column names on the changelog path.

This can return wrong data after schema evolution:

  • after RENAME COLUMN, pre-rename changelog files may read as null;
  • after DROP + ADD of the same name, the new column may read data from the old dropped column.

To Reproduce

create table local.db.t_changelog_rename (id int, old_name string)
using iceberg
tblproperties ('format-version' = '2');

insert into local.db.t_changelog_rename values (0, 'initial');
-- record start snapshot
insert into local.db.t_changelog_rename values (1, 'before');
alter table local.db.t_changelog_rename rename column old_name to new_name;
insert into local.db.t_changelog_rename values (2, 'after');
-- record end snapshot

CALL local.system.create_changelog_view(
  table => 'db.t_changelog_rename',
  changelog_view => 't_changelog_rename_changes',
  options => map(
    'start-snapshot-id', '<start_snapshot_id>',
    'end-snapshot-id', '<end_snapshot_id>'
  )
);

select id, new_name, _change_type, _change_ordinal, _commit_snapshot_id
from t_changelog_rename_changes
order by id;

The native changelog scan may return null for the pre-rename row.

A similar issue exists for drop-then-add with the same column name:

create table local.db.t_changelog_drop_add (id int, value string)
using iceberg
tblproperties ('format-version' = '2');

insert into local.db.t_changelog_drop_add values (0, 'initial');
-- record start snapshot
insert into local.db.t_changelog_drop_add values (1, 'old');
alter table local.db.t_changelog_drop_add drop column value;
alter table local.db.t_changelog_drop_add add column value string;
insert into local.db.t_changelog_drop_add values (2, 'new');
-- record end snapshot

The old value and the re-added value have different Iceberg field IDs, so the old row should not be read as the new column.

Expected behavior

Iceberg changelog scan should resolve data columns by Iceberg field-id, matching Spark/Iceberg results.

For renamed columns, old files should map to the renamed column by field-id.

For drop-then-add of the same name, old dropped column data should not be read as the newly added column.

Screenshots

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions