PG cleanup for moved out member by Hooper9973 · Pull Request #242 · eBay/HomeObject

Hooper9973 · 2024-12-12T02:12:54Z

When a member is moved out of a PG, clean up the PG and reclaim resources:

Reset all chunk allocators and return chunks for reassignment.
Remove the index table associated with this PG.
Remove this PG from the PG map.
Delete the PG and its related shard metadata.

Additionally, enhance restart tests with pristine state validation, covering PG, shard, and blob tests.

Hooper9973 · 2024-12-12T02:16:21Z

pending on eBay/HomeStore#605

src/lib/homestore_backend/hs_pg_manager.cpp

xiaoxichen

I think the crash recovery scenarios need more thinking and explanation.

src/lib/homestore_backend/hs_shard_manager.cpp

xiaoxichen · 2024-12-16T05:44:03Z

src/lib/homestore_backend/heap_chunk_selector.cpp

+            chunk->m_pg_id = std::nullopt;
+            chunk->m_v_chunk_id = std::nullopt;
+
+            pdev_heap->m_heap.emplace(chunk);


are we fine to open this chunk for re-use by other PG at this point??

maybe fine, another PG can pick these chunks but commit has to wait as commit is sequential

That's a problem, at this point, maybe another create pg request is coming in and will reuse some chunks which belong to the destroyed pg. If a crash happened unfortunately, then recover HomeObject, there will be a situation that two pgs use some same chunks.

fixed it by resetting the chunks before destroying the pg super block, and only returning the chunks to the heap once the pg super block has been destroyed.

xiaoxichen · 2024-12-16T05:48:37Z

src/lib/homestore_backend/hs_pg_manager.cpp

+    // destroy index table
+    auto uuid_str = boost::uuids::to_string(hs_pg->index_table_->uuid());
+    index_table_pg_map_.erase(uuid_str);
+    hs()->index_service().remove_index_table(hs_pg->index_table_);


@koujl @shosseinimotlagh could you confirm the remove_index_table works like rm -rf which will proper remove all index entries?

remove_index_table is a memory operation.
https://github.com/eBay/HomeStore/blob/69ea506d3c0cc07f16385ee5d0a9c8f28f64416a/src/lib/index/index_service.cpp#L108

src/lib/homestore_backend/hs_pg_manager.cpp

src/lib/homestore_backend/hs_shard_manager.cpp

Hooper9973 · 2024-12-16T08:31:16Z

Here is a form about crash recovery scenarios.

crash timeline	recovery state	destroy again
action 1 mark pg destroyed	not mark destroyed
	State has been destroyed.	PG (State::DESTROYED)
action 2 destroy shards	shards aren't all destroyed，so it will recovery partial shards.	PG (State::DESTROYED) with partial shards
	Shards are all destroyed, so a pristine pg with no shard will recover.	PG (State::DESTROYED) with No shards.
action 3 reset all pg chunks	Its all memory operation, won't impact anything, so the situation is : Shards all destroy, so it will recovery a pg with no shard. But pg chunks aren't pristine.	PG (State::DESTROYED) with No shards
	Reset all chunks, and do hs()->cp_mgr().trigger_cp_flush(true /* force */), so the pg chunks after recovery are pristine.	PG (State::DESTROYED) with No shards, pristine chunks.
action 4 destroy pg index table	Index_table wasn't destoryed successfully, will recovery a pristine pg.	PG (State::DESTROYED) with No shards, pristine chunks.
	we cannot find index table, will create a pg but its index_table is nullptr.	PG (State::DESTROYED) with No shards, pristine chunks, no index_table.
action 5 pg super blk destroy	pg super blk wasn't destory successfully, recover a pg with a nullptr index_table.	PG (State::DESTROYED) with no shards, chunks are pristine, no index_table.
	no pg found.	no pg found, all things have been clean up, all chunks returns to devheap, don't destroy again
Action 6 erase _pg_map	no pg found	no pg found, all things have been clean up, all chunks returns to devheap, don't destroy again
	no pg found	no pg found, all things have been clean up, all chunks returns to devheap, don't destroy again
Action 7 return pg chunks to dev heap	no pg found	no pg found, all things have been clean up, all chunks returns to devheap, don't destroy again
	no pg found	no pg found, all things have been clean up, all chunks returns to devheap, don't destroy again

When a member is moved out of a PG, clean up the PG and reclaim resources: 1. Mark PG state destroyed. 2. Destroy PG shards and shards super block. 3. Reset all chunks. 4. Remove PG index table. 5. Destroy PG super block. 6. Return PG chunks to dev_heap. Additionally, enhance restart test scenarios.

codecov-commenter · 2024-12-17T07:35:46Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 16.19048% with 88 lines in your changes missing coverage. Please review.

Project coverage is 62.79%. Comparing base (1746bcc) to head (43c7d2a).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
src/lib/homestore_backend/hs_pg_manager.cpp	5.35%	52 Missing and 1 partial ⚠️
src/lib/homestore_backend/heap_chunk_selector.cpp	43.75%	15 Missing and 3 partials ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp	0.00%	11 Missing ⚠️
...ib/homestore_backend/replication_state_machine.cpp	0.00%	6 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #242      +/-   ##
==========================================
- Coverage   63.15%   62.79%   -0.36%     
==========================================
  Files          32       32              
  Lines        1900     2086     +186     
  Branches      204      226      +22     
==========================================
+ Hits         1200     1310     +110     
- Misses        600      665      +65     
- Partials      100      111      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

xiaoxichen

lgtm

xiaoxichen · 2024-12-19T03:11:49Z

src/lib/homestore_backend/hs_pg_manager.cpp

+    // destroy pg super blk
+    hs_pg->pg_sb_.destroy();
+
+    // return pg chunks to dev heap


could you also explains if a crash happened here，what would be the process of recovering this chunks to heap?

Is that because we dont have pg superblock so these chunks are available by default during recovery?

Yes, you are right.If a crash happened here, then we do recovery, as we don't have the pg superblk, so these chunks will belongs to dev heap by default.

JacksonYao287 · 2024-12-19T07:18:57Z

src/lib/homestore_backend/hs_pg_manager.cpp

+void HSHomeObject::reset_pg_chunks(pg_id_t pg_id) {
+    bool res = chunk_selector_->reset_pg_chunks(pg_id);
+    RELEASE_ASSERT(res, "Failed to reset all chunks in pg {}", pg_id);
+    auto fut = homestore::hs()->cp_mgr().trigger_cp_flush(true /* force */);


why triggering a cp flush only after reset chunks, should we trigger flush when all the cleanup is done?

other cleanups are metaService only, which do not rely on cp. I think the intent is to do the cp as early as possible but I am fine to move the cp to anywhere before delete the pg superblk

Because only reset chunks rely on cp, other operation are metaService destroy only.
The reason why we need do cp before deleting pg superblk is if cp is after delete pg superblk, and a crash happened before do cp, when we do recovery we can't find the pg superblk anymore, and can not have the chance to reset chunks again.

Above all, I thought do cp after reset chunks can be good.

JacksonYao287 · 2024-12-19T07:47:58Z

src/lib/homestore_backend/hs_shard_manager.cpp

+
+    auto& pg = iter->second;
+    for (auto& shard : pg->shards_) {
+        // release open shard v_chunk


NIT: remove this comment, we do not care about v_chunk here

JacksonYao287

generally looks good,

Hooper9973 requested review from JacksonYao287 and xiaoxichen December 12, 2024 02:12

Hooper9973 force-pushed the pg_clear branch from 6c98178 to 70face9 Compare December 12, 2024 02:14

Hooper9973 closed this Dec 12, 2024

Hooper9973 reopened this Dec 12, 2024

Hooper9973 marked this pull request as draft December 12, 2024 06:21

Hooper9973 force-pushed the pg_clear branch 2 times, most recently from 1b2870d to 60dd987 Compare December 12, 2024 07:19

Hooper9973 marked this pull request as ready for review December 12, 2024 08:50

Hooper9973 force-pushed the pg_clear branch from 60dd987 to 2e661b9 Compare December 12, 2024 09:32

JacksonYao287 reviewed Dec 12, 2024

View reviewed changes

src/lib/homestore_backend/hs_pg_manager.cpp Show resolved Hide resolved

Hooper9973 force-pushed the pg_clear branch from 2e661b9 to eff6369 Compare December 13, 2024 06:15

xiaoxichen requested changes Dec 16, 2024

View reviewed changes

Hooper9973 force-pushed the pg_clear branch from eff6369 to 34b1d22 Compare December 17, 2024 01:24

Hooper9973 changed the title ~~Clear PG for moved out member~~ PG cleanup for moved out member Dec 17, 2024

Hooper9973 mentioned this pull request Dec 17, 2024

clear pg for moved out member #231

Closed

Hooper9973 force-pushed the pg_clear branch from 34b1d22 to 43c7d2a Compare December 17, 2024 07:19

xiaoxichen approved these changes Dec 19, 2024

View reviewed changes

JacksonYao287 reviewed Dec 19, 2024

View reviewed changes

JacksonYao287 approved these changes Dec 23, 2024

View reviewed changes

Hooper9973 merged commit b9d3fbf into eBay:main Dec 23, 2024
25 checks passed

Hooper9973 deleted the pg_clear branch December 23, 2024 06:31

Conversation

Hooper9973 commented Dec 12, 2024

Uh oh!

Hooper9973 commented Dec 12, 2024

Uh oh!

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Hooper9973 commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hooper9973 commented Dec 16, 2024 •

edited

Loading

codecov-commenter commented Dec 17, 2024 •

edited

Loading