Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replicate createpg message across raft group #136

Merged
merged 10 commits into from
Feb 13, 2024

Conversation

JacksonYao287
Copy link
Collaborator

we need to replicate create PG message across the rafte group, so that the follower can create pginfo and index table.

@JacksonYao287 JacksonYao287 self-assigned this Jan 25, 2024
@JacksonYao287 JacksonYao287 changed the title replicat createpg message across raft group replicate createpg message across raft group Jan 25, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2024

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (c3a1459) 72.43% compared to head (fba3006) 72.23%.

Files Patch % Lines
src/lib/homestore_backend/hs_pg_manager.cpp 66.29% 23 Missing and 7 partials ⚠️
src/include/homeobject/pg_manager.hpp 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #136      +/-   ##
==========================================
- Coverage   72.43%   72.23%   -0.21%     
==========================================
  Files          30       30              
  Lines        1295     1390      +95     
  Branches      131      143      +12     
==========================================
+ Hits          938     1004      +66     
- Misses        278      300      +22     
- Partials       79       86       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if (v.hasError()) { return folly::makeUnexpected(toPgError(v.error())); }
// we will write a PGHeader to the raft group and commit it when PGHeader is commited on raft
// group. so that all raft members will create the PGinfo and index table for this PG.
return do_create_pg(v.value(), std::move(pg_info));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do if this fails? We need to handle that and either offload it somewhere to retry until it works; or unwind this group optimistically so we don't leak a repl_dev.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question!

i do not think we should retry, since if it crashes after create_repl_dev and before sending the createPG message, when restarting, the pginfo is lost, the repl_dev does not know which PG it should belongs.

my solution is
1 if do_create_pg fails, we just delete the repl_dev(unwind this group). create_repl_dev is not frequnent and not in io path, so even if this will be time-consuming, it is acceptable!

2 when restarting and after applying all the logs, get the repl_dev list from Homestore and iterate all the pgs to to find out if there is any repl_dev belonging to no pg and delete this repl_dev.

if we do this , we need to add two apis to Homestore:
1 get_repl_dev_list(repl_dev_type t);
2 delete_repl_dev(uuid )

I think this can make sure we do not leak any repl_dev

cc @hkadayam

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we put group_id in the repl_dev superblock, can that be used to retrieve the pginfo from HomeObject?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxichen
the gorup_id is a randomly generated uuid, and the pg_id is a uint_16.
but yes, we can generate uuid from uint_16, and get unit_16 from previously generated uuid, so that we get a natural map between uuid and uint_16.

the key point is that the pginfo originally exists in the create_pg_message, which need to be replicated across the repl_dev(raft group). if this message is not replicated to the majority of this group, then it is not committed. when restarts, the uncommitted message will probabaly be dropped,and repl_dev can not get pginfo from that dropped message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to say if we can avoid replicating pginfo at all to avoid the 2PC in this PR proposal. If we created the repl_dev we are all set, followers can construct pginfo from repl_dev. During recovery the pginfo also can be recovered by the sb of repl_dev/raft_config;

the 3 required fields we already have
members which can get from raft cluster config, the replica_set_uuid is group_id, the only thing needed is pg_id.

As you said , we can

  1. encode pg_id into replica_set_uuid so we get pg_id out from replicat_set_uuid
  2. put the pg_id into _user_ctx field of libnuraft::cluster_config, c.f link
  3. put the entire serailized_pg_info into _user_ctx

2/3 can be implemented as the cluster_config was loaded by calling RaftReplDev::load_config().

struct PGInfo {
    explicit PGInfo(pg_id_t _id) : id(_id) {}
    pg_id_t id;
    mutable MemberSet members;
    peer_id_t replica_set_uuid;

    auto operator<=>(PGInfo const& rhs) const { return id <=> rhs.id; }
    auto operator==(PGInfo const& rhs) const { return id == rhs.id; }
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a summary, there are two problems:
1 HS side: how to handle create repl_dev failure(for example , fail to add any member to the raft group)?
this is the responsibility of HS, we should not consider this in HO side.

2 HO side: if we get a good repl_dev from HS, but we fail to create PG(for example, createPG message fail to be replicated to the majority of the raft group and is not committed, or crash happens at some moment ), how to do cleanup to avoid repl_dev leak?

according to the meeting today, this will be done in phrase three and by that time we will have my new features in homestore and homeobject, which are probably very helpful for solving this problem. so what about just leave the problem here for now (only happy path) and revisit it in phase 3.
@xiaoxichen @szmyd @sanebay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes lets create an issue and put a #FIXME in the code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure , done

Copy link
Collaborator

@szmyd szmyd Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxichen yes i know; i brought this up initially. Actually; there is one problem that is the generalized form of all problems:

  • I (SM) belong to a raft group (repl_dev) that I am not managing a PG for according to truth (CM).

All issues regarding repl_dev failure, pg_creation failure, migration during offline member (orphaned pg)...they generalize to this. The existence of a PG under that raft group is inconsequential. The solution is what I suggest, anything else is an optimization for a specialization of this issue and need not take the risk of complicating machinery just to correct something that is not in the immediate necessary to correct.

This is from years of fighting the raft membership issues with AM; we do try to cleanup in the cases that we can, but don't worry ourselves about it as the global cleanup routine will correct things. Many attempts to optimize resulted in out-of-quroum groups and deadlocks in the code.

We should just do what's easy here to start so things are correct and we can asses if there are timing issues that warrant an optimization beyond fire-and-forget cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szmyd yes you said makes perfect sense. The raft membership change in SM would be less than AM but still reconcile based on CM's PGMap is the right path. There are some timing issue we need to be careful though.

@@ -7,6 +7,10 @@ void ReplicationStateMachine::on_commit(int64_t lsn, const sisl::blob& header, c
LOGI("applying raft log commit with lsn:{}", lsn);
const ReplicationMessageHeader* msg_header = r_cast< const ReplicationMessageHeader* >(header.cbytes());
switch (msg_header->msg_type) {
case ReplicationMessageType::CREATE_PG_MSG: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the existing UT has covered this case

EXPECT_TRUE(homeobj_->pg_manager()->create_pg(std::move(info)).get());

do we need add more specific test case?

repl_dev->async_read(blkids, value, value.size)
.thenValue([this, lsn, msg_header = *header_ptr, blkids, value](auto&& err) mutable {
if (err) {
LOGW("failed to read data from homestore pba, lsn:{}", lsn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set error in ctx ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if we come here, it is a log reapplicaition in recovery. so hs_ctx is null, and we can not get ctx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code I referred to earlier, we don't need async_read(), if we write as much information as needed in header. This is true for other use cases to replicate (except shard create and blob put)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkadayam yes you said make sense.
is that makes sense if we can let small IO (< pagesize ? or even <16K etc) to through raft? most of the metadata ops like createpg/createshard etc can go through this .

 // If it is header only entry, directly propose to the raft
    if (rreq->value.size > MIN_DATACHANNEL_VALUESIZE) {
         // go through data channel
      } else {
         // propose through raft.
        RD_LOG(INFO, "Skipping data channel send since value size is 0");
        rreq->state.fetch_or(uint32_cast(repl_req_state_t::DATA_WRITTEN));
        m_state_machine->propose_to_raft(std::move(rreq));
    }

@JacksonYao287
Copy link
Collaborator Author

JacksonYao287 commented Jan 27, 2024

I have manually tested this PR with tree replicas. it works as expected!

@szmyd
all the tests are passed in my local env, seem the CI failure is not caused by this PR itself, could you help taking a look?

@JacksonYao287
Copy link
Collaborator Author

this PR can be successfully built after https://github.com/eBay/HomeStore/pull/284/files is merged

@JacksonYao287 JacksonYao287 added this to the MileStone4.1 milestone Feb 1, 2024
@hkadayam hkadayam merged commit 135d76e into eBay:main Feb 13, 2024
20 of 24 checks passed
@JacksonYao287 JacksonYao287 deleted the replcate-createPG-message branch February 15, 2024 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants