Skip to content

[python] Reject or drop stale index when updating globally indexed columns#8045

Merged
JingsongLi merged 14 commits into
apache:masterfrom
XiaoHongbo-Hope:global-index-update-action-option
May 31, 2026
Merged

[python] Reject or drop stale index when updating globally indexed columns#8045
JingsongLi merged 14 commits into
apache:masterfrom
XiaoHongbo-Hope:global-index-update-action-option

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

Purpose

Updating a column covered by a global index via update_by_arrow_with_row_id
caused the index to become outdated — subsequent queries through the index
returned stale data instead of the updated values.

This PR adds the global-index.column-update-action option and enforces it
during single-machine column updates:

  • THROW_ERROR (default): raises when the update touches indexed columns.
  • DROP_PARTITION_INDEX: drops affected index entries from the manifest.

Tests

  • e2e: THROW_ERROR raises + DROP_PARTITION_INDEX succeeds with index removed.
  • Unit: index manifest write/read roundtrip and combine_deletes cases.

TableUpdateByRowId only rewrote the changed column file, leaving any
btree (global) index that covers the column stale. Reads via the index
returned wrong results: queries for the new value missed; queries for
the old value still pointed at a row whose data had changed.

Honor `global-index.column-update-action` after producing the update
messages — THROW_ERROR (default) refuses the update; DROP_PARTITION_INDEX
emits index-delete commit messages so the affected partition indexes are
dropped atomically with the data update.
Extend the btree update e2e to also drive the DROP_PARTITION_INDEX
action via table.copy({...}): the update succeeds, the new value
becomes readable, and the affected btree index entries are dropped
from the new snapshot's index manifest.

Also restore the INDEX_MANIFEST_ENTRY constant that was accidentally
removed in the precursor commit; it is unrelated to this PR.
- Rename `global_index_update.py` → `global_index_update_checker.py`
  to align with Java/Flink naming (`MergeIntoUpdateChecker`).
- Rename `IndexManifestFile.combine` → `combine_deletes` and
  `CommitMessage.index_files` → `index_deletes` to make the
  delete-only subset explicit.
- Align `write()` failure message with Java
  `ObjectsFile.writeWithoutRolling` ("Exception occurs when writing
  records to {path}. Clean up.").
- Drop docstrings that restated what the code does or referenced
  Java/Spark equivalents.
- Rename `IndexManifestFile.combine` → `combine_deletes` and
  `CommitMessage.index_files` → `index_deletes` to make the
  delete-only subset explicit.
- Align `write()` failure message with Java's
  `ObjectsFile.writeWithoutRolling`.
- Remove docstrings that restated what the code does or
  referenced Java/Spark counterparts.
…fest

- Remove hint text ("Set global-index.column-update-action = ...
  rebuild afterwards") from the THROW_ERROR RuntimeError to match
  Spark/Flink message format exactly.
- combine_deletes returns None when all entries are dropped instead
  of writing an empty manifest file.
…heck

- Remove GlobalIndexColumnUpdateAction docstring and ConfigOption
  .with_description() to match other enum options in the file.
- Replace `or` with explicit `is None` check for action resolution.
- Add test_combine_all_deleted_returns_none to lock the empty-
  survivors-return-None behavior.
Instead of checking in _update_by_arrow_with_row_id only, enforce the
global-index.column-update-action at commit time by inspecting
write_cols from committed files. This covers all update paths (update,
upsert, shard) in one place, matching Flink's MergeIntoUpdateChecker
architecture where the check sits before the committer.
Partial appends via TableWrite.with_write_type also set write_cols but
do not rewrite existing rows — they should not trigger the global index
action. Gate on first_row_id being set, which only happens for update
and shard-rewrite paths that actually modify existing data files.
Replace first_row_id guard with check_from_snapshot != -1 to tell
existing-row rewrites from ordinary partial appends. Row tracking
assigns first_row_id to append files during commit, which would
make the previous guard order-dependent.
…tion

Appending new rows with only the indexed column via
with_write_type(['k']) should succeed without THROW_ERROR and
without dropping the existing index manifest.
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 31, 2026 06:50
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft May 31, 2026 06:50
_test_index_manifest_inherited_after_write calls
update_by_arrow_with_row_id which uses pa.Table.sort_by
(PyArrow 7.0+, not available on Python 3.6).
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 31, 2026 07:16
XiaoHongbo-Hope added a commit to XiaoHongbo-Hope/paimon that referenced this pull request May 31, 2026
- Split data_evolution_merge_into.py into three modules:
  - data_evolution_merge_join.py (join construction + distributed write)
  - data_evolution_merge_transform.py (SET evaluation + transforms)
  - data_evolution_merge_into.py (entry, validation, scheduling)
- Return {num_matched, num_inserted, num_unchanged} per Jingsong review.
- Drop allow_multiple_matches; multi-match always errors.
- Silently skip blob columns during update (matches Spark).
- Defer global-index update action to a follow-up after PR apache#8045.
- Drop changes that belonged to PR apache#8045 (index_manifest_file.py,
  commit_message.py index_files field, file_store_commit.py
  index_deletes plumbing).
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The default behavior now rejects updates that touch globally indexed columns, and the opt-in DROP_PARTITION_INDEX path deletes only the affected partition/index entries. The commit-message/index-manifest plumbing and cleanup handling are covered by tests, and CI is green. Looks good to me.

@JingsongLi JingsongLi merged commit d50becf into apache:master May 31, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants