-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
SH test found a read_verify failed with "Invalid shard_id".
The reason is that the blob_id→location index was stale: during crash recovery, GC moved the blob’s data to a new chunk before log replay completed, and the subsequent replay updated the index back to the old location. Reads then targeted the wrong block, finding a header with a different shard_id.
For example:
- T1: shard=1, blob=100 committed at LSN 100.
- T2: SM crashed with cp_lsn=90 and dc_lsn=110.
- T3: On restart, GC manager started and run before replication svc start completed.
- T4: GC moved the blob’s chunk (location changed, e.g., from chunk=40 to chunk=95).
- T5: Log replay applied LSN 100 and reset the blob’s index to the old location (chunk=40), resulting in reads targeting stale blocks that no longer contain the correct shard’s data.
Relevant Logs:
- Blob get failure (invalid shard_id):
[02/05/26 23:51:59.708] [storage_mgr] [debug] [402416] [hs_blob_manager.cpp:322:_get_blob] [traceID=8132044346085171088,shardID=0x48,pg=0,shard=0x48,blob=3099880] Blob Get request: pg=0, group=b9af911a-6c1b-4ff4-94fa-c10aa39f0d0e, shard=0x48, blob=3099880, offset=0, len=0
[02/05/26 23:51:59.708] [storage_mgr] [debug] [402416] [hs_blob_manager.cpp:360:_get_blob_data] [traceID=8132044346085171088,shardID=0x48,pg=0,shard=0x48,blob=3099880] Reading from blkid=[{blk#=287643 count=2 chunk=40},] to buf=0x582e379e2000
[02/05/26 23:51:59.718] [storage_mgr] [error] [18] [hs_blob_manager.cpp:708:do_verify_blob] Invalid shard_id in header: [header=magic=0x21fdffdba8d68fc6 version=2 shard=0x57 blob_size=4096 user_size=32 algo=1 hash=65 2a 49 ae ..., user_key=storage_hammer:nuobject2-sh-1007]
- Blob put and commit to chunk 40:
[02/02/26 09:18:47.222] [storage_mgr] [debug] [98] [hs_blob_manager.cpp:118:_put_blob] [traceID=15106702424960586969,shardID=0x48,pg=0,shard=0x48,blob=3099880] Blob Put request: pg=0, group=b9af911a-6c1b-4ff4-94fa-c10aa39f0d0e, shard=0x48, length=4096
[02/02/26 09:18:47.222] [storage_mgr] [trace] [98] [hs_blob_manager.cpp:193:_put_blob] ... header=... shard=0x48 ...
[02/02/26 09:18:47.222] [storage_mgr] [debug] [98] [hs_blob_manager.cpp:497:blob_put_get_blk_alloc_hints] ... Picked p_chunk_id=40 ...
[02/02/26 09:18:49.269] [storage_mgr] [trace] [79] [hs_blob_manager.cpp:257:on_blob_put_commit] blob put commit lsn=4920918, pbas=[{blk#=287643 count=2 chunk=40},]
[02/02/26 09:18:49.269] [storage_mgr] [debug] [79] ... Put blob success blkid=[{blk#=287643 count=2 chunk=40},]
-
Checkpoint flush during put:
[02/02/26 09:18:48.923] [storage_mgr] [debug] [21] [raft_repl_dev.cpp:2456:cp_flush] ... cp flush ... lsn=4920910, clsn=3910000 ... -
Shard sealed:
[02/02/26 09:18:56.311] [storage_mgr] [debug] [99] [hs_shard_manager.cpp:256:_seal_shard] ... Seal shard request: is_open=true
[02/02/26 09:18:56.313] [storage_mgr] [debug] [79] ... Shard sealed success, is_open=false
- Restart order: GC starts before log replay completes:
[02/02/26 09:21:44.725] [storage_mgr] [info] [10] [raft_repl_dev.cpp:88:RaftReplDev] ... commited_lsn=4923664, checkpoint_lsn:4920910 ...
[02/02/26 09:21:44.728] [storage_mgr] [info] [10] [hs_homeobject.cpp:358:operator()] Starting GC manager
[02/02/26 09:21:44.728] [storage_mgr] [info] [10] [raft_repl_service.cpp:171:start] Starting LogStore service, fist_boot = false
- GC moves data from chunk 40 to 95:
[02/02/26 09:22:44.728] [storage_mgr] [debug] [69] [gc_manager.cpp:1206:process_gc_task] ... start process gc task for move_from_chunk=40 ...
[02/02/26 09:24:50.015] [storage_mgr] [info] [69] [gc_manager.cpp:1307:process_gc_task] ... task for move_from_chunk=40 to move_to_chunk=95 ... completed!
- Log replay re-applies old location (chunk=40):
[02/02/26 09:28:04.506] [storage_mgr] [trace] [10] [raft_repl_dev.cpp:2583:on_log_found] ... Applying ... lsn=4920918 ...
[02/02/26 09:28:04.506] [storage_mgr] [debug] [10] [raft_repl_dev.cpp:2638:on_log_found] Replay log on restart, ... local_blkid=[[{blk#=287643 count=2 chunk=40},] ]
[02/02/26 09:28:04.506] [storage_mgr] [debug] [10] [raft_repl_dev.cpp:1631:handle_commit] ... Commit ... local_blkid=[[{blk#=287643 count=2 chunk=40},] ]
[02/02/26 09:28:04.506] [storage_mgr] [trace] [10] [hs_blob_manager.cpp:222:local_add_blob_info] ... blob put commit ... pbas=[{blk#=287643 count=2 chunk=40},]
...
[02/02/26 09:28:04.834] [storage_mgr] [info] [10] [raft_repl_service.cpp:173:start] Started LogStore service, log replay should already done till this point
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels