Skip to content

disable restart for destroy-pending repl-dev#605

Merged
JacksonYao287 merged 3 commits intoeBay:masterfrom
JacksonYao287:donot-restart-destroy-pending-repl
Dec 17, 2024
Merged

disable restart for destroy-pending repl-dev#605
JacksonYao287 merged 3 commits intoeBay:masterfrom
JacksonYao287:donot-restart-destroy-pending-repl

Conversation

@JacksonYao287
Copy link
Contributor

@JacksonYao287 JacksonYao287 commented Dec 11, 2024

when we try to destroy a repl_dev , we will first mark it to destroy-pending state, and then a background gc thread will try to find it periodically and permanently destroy it. however, if crash happens before it is permanently destroyed, then there will be some issue left.

1 a destroy-pending repl-dev will not be put into m_rd_map , so raft_group_config_found will return a nullptr for this repl_dev, and thus repl_dev->restart will cause a nullpointer fault(fixed).

2 when permanently destroy a repl_dev we will

    m_rd_sb.destroy();
    m_raft_config_sb.destroy();
    m_data_journal->remove_store();
    logstore_service().destroy_log_dev(m_data_journal->logdev_id());

if crash happens after m_rd_sb.destroy(), but before m_data_journal->remove_store() , we will have no chance to reclaim log related resource for this repl_dev.

this pr checks and reclaims the resource when start and destory repl_dev superblk only after all the related resource are reclaimed

@JacksonYao287 JacksonYao287 marked this pull request as draft December 11, 2024 03:35
@codecov-commenter
Copy link

codecov-commenter commented Dec 11, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 10.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 66.51%. Comparing base (1a0cef8) to head (4933dfb).
Report is 103 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/logstore/log_dev.cpp 0.00% 2 Missing and 1 partial ⚠️
src/lib/logstore/log_store_service.cpp 0.00% 2 Missing and 1 partial ⚠️
src/lib/replication/service/raft_repl_service.cpp 0.00% 2 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #605       +/-   ##
===========================================
+ Coverage   56.51%   66.51%   +10.00%     
===========================================
  Files         108      109        +1     
  Lines       10300    10836      +536     
  Branches     1402     1484       +82     
===========================================
+ Hits         5821     7208     +1387     
+ Misses       3894     2921      -973     
- Partials      585      707      +122     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in general.

Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also one thing i am not clear ,

void LogDev::handle_unopened_log_stores(bool format) {

this seems trying to similar target to GC the leaked logstores, it doesnt take care the logdev but it is not hard to add that when a logdev has zero open log store we GC the logdev?

I think better to have one clear solution, if we believe we have handled the GC here, the handle_unopened_log_stores should be change to ASSERT or LOG_ERROR

Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

feel free to merge if want to create a ticket and work on enhancement later

xiaoxichen
xiaoxichen previously approved these changes Dec 16, 2024
@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Dec 16, 2024

I think better to have one clear solution

this is a good question. after go through the code, the logic is as follows.

1 destroy log store
for a particular log dev , a log store will be put into pending gc state(in m_garbage_store_ids) in two cases:
1 if not been opened before logdev starting
2 be removed(remove_log_store) at runtime.

when log truncation happens , if all the log entries of this log store is truncated, it will be permanently destroyed.

2 destroy logdev
it will release all the chunks of this logdev, and delete the metablk of it. and as a result , all the log store of this logdev is also detroyed , since all the logstore id of this logdev in the the metablk of this logdev.

coming to repl_dev case, if we do not open log dev, it will also be permanently destroyed.

// skip it.

// 3 logdev will be destroyed in delete_unopened_logdevs() if we don't open it(create repl_dev) here, so skip
// it.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need do nothing here, since if we do not create the repl_dev , the related log_dev will not be opened, as a result , the log dev will be eventually detroyed at

hs()->logstore_service().delete_unopened_logdevs();

@JacksonYao287 JacksonYao287 merged commit 6756b81 into eBay:master Dec 17, 2024
21 checks passed
@JacksonYao287 JacksonYao287 deleted the donot-restart-destroy-pending-repl branch December 17, 2024 00:19
hkadayam pushed a commit to hkadayam/HomeStore that referenced this pull request Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants