[python] Reject or drop stale index when updating globally indexed columns#8045
Merged
JingsongLi merged 14 commits intoMay 31, 2026
Merged
Conversation
TableUpdateByRowId only rewrote the changed column file, leaving any btree (global) index that covers the column stale. Reads via the index returned wrong results: queries for the new value missed; queries for the old value still pointed at a row whose data had changed. Honor `global-index.column-update-action` after producing the update messages — THROW_ERROR (default) refuses the update; DROP_PARTITION_INDEX emits index-delete commit messages so the affected partition indexes are dropped atomically with the data update.
Extend the btree update e2e to also drive the DROP_PARTITION_INDEX
action via table.copy({...}): the update succeeds, the new value
becomes readable, and the affected btree index entries are dropped
from the new snapshot's index manifest.
Also restore the INDEX_MANIFEST_ENTRY constant that was accidentally
removed in the precursor commit; it is unrelated to this PR.
- Rename `global_index_update.py` → `global_index_update_checker.py`
to align with Java/Flink naming (`MergeIntoUpdateChecker`).
- Rename `IndexManifestFile.combine` → `combine_deletes` and
`CommitMessage.index_files` → `index_deletes` to make the
delete-only subset explicit.
- Align `write()` failure message with Java
`ObjectsFile.writeWithoutRolling` ("Exception occurs when writing
records to {path}. Clean up.").
- Drop docstrings that restated what the code does or referenced
Java/Spark equivalents.
- Rename `IndexManifestFile.combine` → `combine_deletes` and `CommitMessage.index_files` → `index_deletes` to make the delete-only subset explicit. - Align `write()` failure message with Java's `ObjectsFile.writeWithoutRolling`. - Remove docstrings that restated what the code does or referenced Java/Spark counterparts.
…fest
- Remove hint text ("Set global-index.column-update-action = ...
rebuild afterwards") from the THROW_ERROR RuntimeError to match
Spark/Flink message format exactly.
- combine_deletes returns None when all entries are dropped instead
of writing an empty manifest file.
…heck - Remove GlobalIndexColumnUpdateAction docstring and ConfigOption .with_description() to match other enum options in the file. - Replace `or` with explicit `is None` check for action resolution. - Add test_combine_all_deleted_returns_none to lock the empty- survivors-return-None behavior.
Instead of checking in _update_by_arrow_with_row_id only, enforce the global-index.column-update-action at commit time by inspecting write_cols from committed files. This covers all update paths (update, upsert, shard) in one place, matching Flink's MergeIntoUpdateChecker architecture where the check sits before the committer.
Partial appends via TableWrite.with_write_type also set write_cols but do not rewrite existing rows — they should not trigger the global index action. Gate on first_row_id being set, which only happens for update and shard-rewrite paths that actually modify existing data files.
Replace first_row_id guard with check_from_snapshot != -1 to tell existing-row rewrites from ordinary partial appends. Row tracking assigns first_row_id to append files during commit, which would make the previous guard order-dependent.
…tion Appending new rows with only the indexed column via with_write_type(['k']) should succeed without THROW_ERROR and without dropping the existing index manifest.
_test_index_manifest_inherited_after_write calls update_by_arrow_with_row_id which uses pa.Table.sort_by (PyArrow 7.0+, not available on Python 3.6).
XiaoHongbo-Hope
added a commit
to XiaoHongbo-Hope/paimon
that referenced
this pull request
May 31, 2026
- Split data_evolution_merge_into.py into three modules:
- data_evolution_merge_join.py (join construction + distributed write)
- data_evolution_merge_transform.py (SET evaluation + transforms)
- data_evolution_merge_into.py (entry, validation, scheduling)
- Return {num_matched, num_inserted, num_unchanged} per Jingsong review.
- Drop allow_multiple_matches; multi-match always errors.
- Silently skip blob columns during update (matches Spark).
- Defer global-index update action to a follow-up after PR apache#8045.
- Drop changes that belonged to PR apache#8045 (index_manifest_file.py,
commit_message.py index_files field, file_store_commit.py
index_deletes plumbing).
Merged
leaves12138
approved these changes
May 31, 2026
Contributor
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the fix. The default behavior now rejects updates that touch globally indexed columns, and the opt-in DROP_PARTITION_INDEX path deletes only the affected partition/index entries. The commit-message/index-manifest plumbing and cleanup handling are covered by tests, and CI is green. Looks good to me.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Updating a column covered by a global index via
update_by_arrow_with_row_idcaused the index to become outdated — subsequent queries through the index
returned stale data instead of the updated values.
This PR adds the
global-index.column-update-actionoption and enforces itduring single-machine column updates:
Tests