Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: merge predicate for concurrent writes #2291

Merged
merged 7 commits into from
Mar 19, 2024

Conversation

JonasDev1
Copy link
Contributor

@JonasDev1 JonasDev1 commented Mar 15, 2024

Description

This merge request will change the commitInfo.operationParameters.predicate of the merge operation. This is required to allow conflict checking on concurrent commits. Before, the predicate was just the simple merge predicate like source.eventId = target.eventId, but the conflict checker doesn't know these aliases and doesn't have access to the source df.
So I now use the early_filter, which is also used before to filter a subset of the target table. Basically, the early_filter only contains static filters and partition filters that are converted to fixed values based on the source data. The early_filter can be None if no column/partition pre-filtering is possible, or if the merge contains a not_match_source operation. (See the generalize_filter function in the file).
The commitInfo predicate uses exactly this filter, except that the target alias is removed.

The predicate is used by the conflict checker, for example when there are multiple concurrent merges. If there is a predicate, the conflict checker will check whether the concurrent commit wrote or deleted data within that predicate. If the predicate is None, the conflict checker will treat the commit as a `read_whole_table' and interpret any concurrent updates as a conflict.

Example:
Target table with partition country
Merge with predicate source.id = target.id AND target.country='DE' -> CommitInfo predicate country='DE'
Merge with predicate source.id = target.id AND target.country=source.country -> CommitInfo predicate country='DE' OR country='US'
Merge with predicate source.id = target.id -> CommitInfo predicate None (As full target table join is required)

Related Issue(s)

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Mar 15, 2024
@ion-elgreco
Copy link
Collaborator

@Blajda :)

@JonasDev1
Copy link
Contributor Author

JonasDev1 commented Mar 16, 2024

I have found another bug: After the successful merge, the merge operation will return a DeltaTable, for this the new commit will be merged in the old snapshot, but the concurrent commit is not included in it, which results in a version mismatch error. I'll try to recreate this in an integration test.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Mar 16, 2024

@JonasDev1 yup it's a known issue and there is a PR open for it in the write operation: #2280, but we probably need to address the commit advancement across the board

@JonasDev1
Copy link
Contributor Author

I have added some merge concurrency tests where the version mismatch error also occurs

Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
The Delta protocol doesn't specify anything regarding conflicting resolution so I think this is okay.
Some of the details might change if we need to inter-op with other implementations.

@ion-elgreco ion-elgreco merged commit 42e24dd into delta-io:main Mar 19, 2024
20 checks passed
Blajda added a commit that referenced this pull request Mar 23, 2024
Blajda added a commit that referenced this pull request Mar 23, 2024
reverts #2291

Introduces a regression to merge that occurred before
#2158

Causes the added test case to fail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merging to a table with multiple distinct partitions in parallel fails _internal.DeltaError when merging
3 participants