Skip to content

Comments

branch-2.1: [Bug] Fix accidental table deletion during restore job #48820#49498

Merged
yiguolei merged 1 commit intobranch-2.1from
auto-pick-48820-branch-2.1
Mar 29, 2025
Merged

branch-2.1: [Bug] Fix accidental table deletion during restore job #48820#49498
yiguolei merged 1 commit intobranch-2.1from
auto-pick-48820-branch-2.1

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #48820

@github-actions github-actions bot requested a review from yiguolei as a code owner March 26, 2025 03:20
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Mar 26, 2025
@hello-stephen
Copy link
Contributor

run buildall

### What problem does this PR solve?
如果使用CCR配置库同步任务时,目标库下有同名的表,会导致误删除表,以及master与follower 表meta不一致。

#### CCR任务对于目标库下已经有相同表的处理流程

在FE端判断如果Restore的表已经存在,会校验新表和原表的scheme等信息是否一致,如果不一致会抛出异常(Table {} already
exists but with different schema, "+ "local table: {}, remote table:
{}),本次Restore任务失败;这时ccr-syncer服务收到该异常会catch处理,会对表进行alias重命名(__ccr_tablename_timestamp),重新发起Restore请求到FE,如果FE这时Restore成功,syncer服务会执行replace
table(swap=false)来替换表,以完成同步。

#### 当前Fe处理逻辑

有一个for循环会对每个需要恢复的表进行判断,如果判断已经存在的表和将要同步的表scheme不同,会直接返回失败并cancel
Restore任务;当有多个表重复时,一次Restore只返回一个表异常,这会导致Syncer服务不断的发起Restore操作,直到把所有的表加上alias。

#### Fe处理逻辑中的问题


因为是恢复alias后的表名,所以走表不存在的处理逻辑,这个时候会使用backup的表scheme来构造table对象,最后将表名更新为alias的名称,问题的关键是添加到restoredTable的逻辑和判断表scheme是否一致是在一个循环中,第一次按正常别名处理后,会在restoredTables中添加alias的表,但循环到第二个表如果表scheme不一致会直接return返回异常,这时不会将第一次的表名set为alias名,相当于直接把源库的表名加到了restoredTable中,这时restore任务失败后,会在cancel善后逻辑中将创建的alias表在restoreTable删除掉,但这个时候其实不是alias的表名,是正确的表名,表就被这么删除掉了!!!

经过不断Restore操作,Syncer服务会把所有表都alias,这时restore任务就可以成功了,
在Syncer中对每个表执行replace table时在master中源表其实是不存在的,会出现异常,永远无法恢复。

#### 为什么FE master和follower表Meta不一致?

master在处理restore job时,只有download、commit、finished、cancel状态将会将restore
Job对象存到BDB,在第一个表抛出异常后,状态是pending,不会同步到follower,在多次restore成功后,表名是alias的名称,所以follower记录不会replay
drop table的操作,导致follower永远是原始手动创建表的Meta。


Co-authored-by: wubiao02 <wubiao02@meituan.com>
@yiguolei yiguolei force-pushed the auto-pick-48820-branch-2.1 branch from 8978ec0 to c335ac1 Compare March 29, 2025 03:22
@yiguolei
Copy link
Contributor

run buildall

@yiguolei yiguolei merged commit 8206490 into branch-2.1 Mar 29, 2025
21 of 24 checks passed
@github-actions github-actions bot deleted the auto-pick-48820-branch-2.1 branch March 29, 2025 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants