-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Fix](cloud-mow) Fix correctness problem when there exists other interleaved txn between a txn's retries #50417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix](cloud-mow) Fix correctness problem when there exists other interleaved txn between a txn's retries #50417
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
1238a6c
to
1a8f195
Compare
run buildall |
TPC-H: Total hot run time: 33602 ms
|
TPC-DS: Total hot run time: 193276 ms
|
ClickBench: Total hot run time: 29.25 s
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression P0 && UT Coverage ReportIncrement line coverage Increment coverage report
|
38f8dfb
to
f019ba0
Compare
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
TPC-H: Total hot run time: 33957 ms
|
TPC-DS: Total hot run time: 185466 ms
|
41f4215
to
f34231c
Compare
run buildall |
TPC-H: Total hot run time: 33980 ms
|
TPC-DS: Total hot run time: 185641 ms
|
ClickBench: Total hot run time: 28.59 s
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression P0 && UT Coverage ReportIncrement line coverage Increment coverage report
|
run feut |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…rleaved txn between a txn's retries (apache#50417) 出问题的情况: 1. txn=X 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task1,这包括 tablet=A 和 tablet=B 上计算 2. txn=X 在 tablet=A 上计算完成,在 ms 写下 ver=V1 的delete bitmap,并将其写入到 tablet=A 的pending delete bitmap KV中 3. txn=X 因为 task1 计算超时,主动释放锁 4. txn=Y 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task2,包括 tablet=A 5. txn=Y 在 tablet=A 上计算完成,用 tablet=A 上的 的pending delete bitmap KV删除 txn=X 写下的delete bitmap,并写下自己的 ver=V1 的delete bitmap 6. txn=Y 因为在某些 tablet 上计算超时,主动释放锁 7. txn=X 第二次拿锁,仍然尝试在 ver=V1 提交,向be1下发计算task3,这包括 tablet=A 和 tablet=B 上计算 8. 此时be1上的 task1 还没有完成,task3 在be1上注册失败 9. task1 在 tablet=B上计算完成,task1上报成功结果到fe。txn=X 成功在 ver=V1 上提交,但它在 tablet=A 上 ver=V1 的delete bitmap已经被删除了
…et_id) being executed concurrently (#50847) ### What problem does this PR solve? After #50417, there may be multiple calc delete bitmap tasks with different signatures on the same (txn_id, tablet_id) load in same BE. We use _rowset_update_lock to avoid them being executed concurrently to avoid correctness problem. e.g. rowset meta and segment data object mismatches due to concurrent writes on same rowset with transient rowset writer in partial update publish phase ``` W20250513 15:50:55.371588 1049 file_reader.cpp:36] [NOT_FOUND]failed to read from : code=NOT_FOUND, type=16, request_id=failed to read W20250513 15:50:55.371667 1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from : code=NOT_FOUND, type=16, request_id=failed to read ```
…et_id) being executed concurrently (apache#50847) After apache#50417, there may be multiple calc delete bitmap tasks with different signatures on the same (txn_id, tablet_id) load in same BE. We use _rowset_update_lock to avoid them being executed concurrently to avoid correctness problem. e.g. rowset meta and segment data object mismatches due to concurrent writes on same rowset with transient rowset writer in partial update publish phase ``` W20250513 15:50:55.371588 1049 file_reader.cpp:36] [NOT_FOUND]failed to read from : code=NOT_FOUND, type=16, request_id=failed to read W20250513 15:50:55.371667 1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from : code=NOT_FOUND, type=16, request_id=failed to read ```
…rleaved txn between a txn's retries (apache#50417) ### What problem does this PR solve? 出问题的情况: 1. txn=X 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task1,这包括 tablet=A 和 tablet=B 上计算 2. txn=X 在 tablet=A 上计算完成,在 ms 写下 ver=V1 的delete bitmap,并将其写入到 tablet=A 的pending delete bitmap KV中 3. txn=X 因为 task1 计算超时,主动释放锁 4. txn=Y 第一次拿锁,尝试在 ver=V1 提交,向be1下发计算task2,包括 tablet=A 5. txn=Y 在 tablet=A 上计算完成,用 tablet=A 上的 的pending delete bitmap KV删除 txn=X 写下的delete bitmap,并写下自己的 ver=V1 的delete bitmap 6. txn=Y 因为在某些 tablet 上计算超时,主动释放锁 7. txn=X 第二次拿锁,仍然尝试在 ver=V1 提交,向be1下发计算task3,这包括 tablet=A 和 tablet=B 上计算 8. 此时be1上的 task1 还没有完成,task3 在be1上注册失败 9. task1 在 tablet=B上计算完成,task1上报成功结果到fe。txn=X 成功在 ver=V1 上提交,但它在 tablet=A 上 ver=V1 的delete bitmap已经被删除了
…et_id) being executed concurrently (apache#50847) ### What problem does this PR solve? After apache#50417, there may be multiple calc delete bitmap tasks with different signatures on the same (txn_id, tablet_id) load in same BE. We use _rowset_update_lock to avoid them being executed concurrently to avoid correctness problem. e.g. rowset meta and segment data object mismatches due to concurrent writes on same rowset with transient rowset writer in partial update publish phase ``` W20250513 15:50:55.371588 1049 file_reader.cpp:36] [NOT_FOUND]failed to read from : code=NOT_FOUND, type=16, request_id=failed to read W20250513 15:50:55.371667 1049 beta_rowset.cpp:202] failed to open segment. data/1747122561886/020000000000000125473fbacc484a4f8c46478ab6f64b90_2.dat under rowset 020000000000000125473fbacc484a4f8c46478ab6f64b90 : [NOT_FOUND]failed to read from : code=NOT_FOUND, type=16, request_id=failed to read ```
What problem does this PR solve?
出问题的情况:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)