[fix](cloud)(restore) fix broken schema during restore of lsc=false tables#62708
[fix](cloud)(restore) fix broken schema during restore of lsc=false tables#62708LemonCL wants to merge 2 commits into
Conversation
…ables
Problem
-------
In cloud mode, when a snapshot of a table created with
`light_schema_change = false` is restored, the schema KV persisted under
`meta_schema_key({instance_id, index_id, schema_version})` ends up with
every column carrying `unique_id = -1`, because the schema is first
written by `create_tablet` and at that point the unique_ids are not yet
assigned. `commit_restore_job` then receives the correct schema (with
valid `unique_id >= 0`) in each rowset meta from the backup, but the
existing `put_schema_kv` is a no-op when the key already exists, so the
broken schema leaks through and subsequent reads fail with errors such
as `column reader is nullptr` or `different type between schema and
column reader`.
Fix
---
Introduce `put_schema_kv_on_restore()` in the cloud MetaService. It
reads the existing schema value, detects the broken-schema signature
(`column(0).unique_id() == -1` or an unparseable value), range-removes
all chunks of the stale schema, and writes the correct one. To avoid
replacing a bad schema with another bad one, it also refuses to write
a schema that is itself broken (empty columns or `column(0).unique_id
== -1`) and only logs a warning in that case.
In `MetaServiceImpl::commit_restore_job`, replace both existing
`put_schema_kv` call sites with `put_schema_kv_on_restore`, guarded by
an in-RPC `std::set<std::string>` so the same
`(index_id, schema_version)` pair does not issue redundant FDB
reads/writes when the restore spans many rowsets. Four counters
(`rs_meta_schema_put_cnt/skip_cnt`,
`tablet_meta_schema_put_cnt/skip_cnt`) are logged at the end of the
RPC to make the behaviour observable in production.
`put_versioned_schema_kv()` is intentionally NOT wrapped by the dedup
set: it targets a different key space (`versioned::meta_schema_key`)
and is already skip-if-exists inside the function.
Tests
-----
`cloud/test/meta_service_test.cpp` adds 5 unit tests covering every
branch of `put_schema_kv_on_restore`:
- `PutWhenKeyNotExist` — first-write path
- `NoopWhenExistingSchemaIsGood` — skip-if-healthy path
- `OverwriteWhenExistingIsBroken` — the fix itself
- `DefensiveSkipWhenIncomingHasEmptyColumns` — defensive guard
- `DefensiveSkipWhenIncomingHasUidNegativeOne` — defensive guard
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
Hi @plat1ko @meiyi @wyxxxcat @xy720 , could you please take a look It addresses a correctness bug on the cloud restore path: when a Thanks! |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: N/A
Problem Summary:
In cloud mode, when restoring a snapshot of a table that was created with
light_schema_change = false, the schema KV undermeta_schema_key({instance_id, index_id, schema_version})ends up with everycolumn carrying
unique_id = -1, because the schema is first written bycreate_tablet(where unique_ids are not yet materialized).commit_restore_jobthen receives the correct schema from each rowset meta in the backup, but
the existing
put_schema_kvis a no-op when the key already exists, so thebroken schema leaks through and subsequent reads fail with errors such as
column reader is nullptrordifferent type between schema and column reader.Upstream has a related DCHECK in
put_schema_kvthat compares new-vs-savedschemas (introduced by #55247), and an explicit comment in
MetaServiceImpl::update_tabletwarning thatput_schema_kvskips writingwhen the key already exists, but the restore code path was never patched. PR
#57074 fixed a related index_id issue in the same restore path but did not
touch column
unique_id. PR #50657 only prevents new cloud tables fromsetting
light_schema_change=false, so it does not help legacy snapshotsbeing restored.
Root cause
create_tablet(called during restore) writes aTabletSchemaCloudPBwithall
unique_id = -1into the schema KV. This is expected behaviour forlight_schema_change=falseat tablet-creation time.commit_restore_jobreceives the real schema (withunique_id >= 0) ineach rowset meta, and tries to persist it via
put_schema_kv.put_schema_kvsees the key already exists and silently returns withoutwriting, so the broken schema remains on disk forever.
Fix
put_schema_kv_on_restore()incloud/src/meta-service/meta_service_schema.cpp:column(0).unique_id() == -1,(
[schema_key, schema_key + encode_int64(INT64_MAX))) and writes thecorrect one.
(
column_size() == 0orcolumn(0).unique_id() == -1), skip the writeand log a WARNING. This guarantees we never replace a bad schema with
another bad one, even if a future caller passes a malformed schema.
MetaServiceImpl::commit_restore_jobincloud/src/meta-service/meta_service.cpp:put_schema_kv()call sites withput_schema_kv_on_restore(),std::set<std::string> restored_schema_keysso the same
(index_id, schema_version)pair does not issue redundantFDB reads/writes when the restore spans many rowsets,
the RPC (
rs_meta_schema_put_cnt,rs_meta_schema_skip_cnt,tablet_meta_schema_put_cnt,tablet_meta_schema_skip_cnt).put_versioned_schema_kv()is intentionally NOT wrapped by the dedup setnor rerouted; it targets a separate key space
(
versioned::meta_schema_key), already skip-if-exists internally, and isout of scope for this fix.
Tests
cloud/test/meta_service_test.cppcovering every branch ofput_schema_kv_on_restore:PutSchemaKvOnRestoreTest.PutWhenKeyNotExist— first-write path.PutSchemaKvOnRestoreTest.NoopWhenExistingSchemaIsGood— skip-if-healthypath.
PutSchemaKvOnRestoreTest.OverwriteWhenExistingIsBroken— the actualbug-fix path: seeds a broken schema with
unique_id=-1, calls the newfunction, and asserts the resulting KV holds the good schema.
PutSchemaKvOnRestoreTest.DefensiveSkipWhenIncomingHasEmptyColumns—verifies the defensive guard refuses an empty incoming schema.
PutSchemaKvOnRestoreTest.DefensiveSkipWhenIncomingHasUidNegativeOne—verifies the defensive guard refuses an incoming schema whose first
column still has
unique_id == -1.Run just these tests with:
Release note
Fix a correctness bug in cloud mode where restoring a snapshot of a table
created with
light_schema_change = falsewould leave the schema KV withevery column having
unique_id = -1, causing subsequent queries on therestored table to fail with
column reader is nullptror similar errors.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)