Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: re-implementation of sp exit #1279

Merged
merged 62 commits into from Dec 26, 2023
Merged

feat: re-implementation of sp exit #1279

merged 62 commits into from Dec 26, 2023

Conversation

constwz
Copy link
Contributor

@constwz constwz commented Dec 18, 2023

Description

This PR aims at implementing the SP exit data recover process and required CMDs.

This SP exit process is outlined below:

  1. SP sends MsgStorageProviderExit to Greenfield, it is status will become STATUS_GRACEFUL_EXITING
  2. All other SPs interested to be the successors of the exiting SP's Global Virtual Group family(VGF) as primary SP, or Global Virtual group(GVG) as secondary. They can send a tx with MsgReserveSwapIn to reserve the position.
  3. Successor SP will recover data from VGF/GVG the exiting SP belong to. If the exiting SP is the primary SP in the VGF,
    the recovery will happen between the successor and secondary SPs of VGF. If the exiting SP is the secondary SP, the recovery will happen between the successor and the primary SP of GVG.
  4. Once the successor got all required data, they will send a tx with MsgCompleteSwapIn to ack the success, within GVGF/GVG, SP replacement will take place.
  5. If the SP has no more GVGF/GVG associated, anyone can send a tx with MsgCompleteStorageProviderExit to complete such SP's exit.

For more details, refer to bnb-chain/BEPs#338

Specifically, this PR will be mainly focus on step 3 and 4. Which allows successor SP to recover data to achieve the SP exit.
Implementation:

  1. RecoverGVGScheduler: GVG is the unit to init a scheduler, the scheduler will constantly fetch every batch of object by meta api ListObjectsInGVG with params StartAfter and Limit. And push object's related recover piece tasks to recoverQueue. After it iterates all objects in the GVG, regardless objects are all recovered or there exists failure, it will mark the GVG status to processed.
  2. RecoverFailedObjectScheduler: a scheduler that specifically for recovering objects found failed to be recovered. The failure coming from the RecoverGVGScheduler and VerifyGVGScheduler.
  3. VerifyGVGScheduler: verify that every object is indeed recovered or not, if not, the object will be picked up by RecoverFailedObjectScheduler. Once all objects found recovered, it will automatically send a tx MsgCompleteSwapIn to chain and then stop all schedulers. If there are objects that cant be recovered and exceeding the retrial limit. The scheduler will also stop and user need to query failed objects by CMD listed below. Either discontinue it or retry.

Sp exit CMD
exit sp

  1. ./gnfd-sp spExit --config ./config.toml
  2. ./gnfd-sp completeSpExit --config ./config.toml

successor sp CMD

  1. send swapIn tx ./gnfd-sp swapIn --config ./config.toml -f vgf id --gid gvg id -sp target sp id
  2. recover resource
    ./gnfd-sp recover-gvg --config ./config.toml --gid gvg id
    ./gnfd-sp recover-vgf --config ./config.toml -f vgf id
  3. query process ./gnfd-sp query-recover-p --config ./config.toml -f vgf id --gid gvg id
  4. complete swap ./gndf-sp completeSwapIn --config ./config.toml -f vgf id --gid gvg id

exit tool CMD

  1. ListGlobalVirtualGroupsBySecondarySP
    ./gnfd-sp query-gvg-by-sp --config ./config.toml -sp spid
  2. ListVirtualGroupFamiliesBySpIDAction
    ./gnfd-sp query-vgf-by-sp --config ./config.toml -sp spid

Rationale

The process for sp exit needs to be simpler and easier to maintain

Example

To exit an sp, only need to run two cmd commands

Changes

Notable changes:

  • Added cmd related to sp exit
  • The logic of the recover object is modified

Potential Impacts

  • recover object

@@ -131,6 +131,16 @@ func init() {
command.SetQuotaCmd,
// block syncer
bs_data_migration.BsDataMigrationCmd,
// sp exit v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we officially call it "v2"?

}

func SpExitAction(ctx *cli.Context) error {
cfg, err := utils.MakeConfig(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This action looks very critical.

It will be nice to have a "re-confirm" mechanism ( e.g. output a warning and let SP operator to input its sp address to reconfirm)

BTW, can this operation be cancelable?

@constwz
Copy link
Contributor Author

constwz commented Dec 20, 2023

This PR will be a significant reference for SP owners and testers. So

Can we add detailed examples in Example section for

  1. how a SP can be exited
  2. how a successor SP can take over those data?
  3. how does the exiting SP or successor SP know the progress of exiting?
  4. After a fully completion of sp-exit process, what the impacts will be exerted to the end users
  5. Any other details which can help both SP owners and testers for the actual SP-EXIT operations

The operation method has been added

@constwz
Copy link
Contributor Author

constwz commented Dec 20, 2023

please fix failure checks

The commit message can be resolved using a squash merge, and other checks have passed

@@ -156,12 +156,32 @@ func (gc *GCWorker) checkGVGMatchSP(ctx context.Context, objectInfo *storagetype

if redundancyIndex == piecestore.PrimarySPRedundancyIndex {
if gvg.GetPrimarySpId() != spID {
swapInInfo, err := gc.e.baseApp.Consensus().QuerySwapInInfo(ctx, gvg.FamilyId, virtualgrouptypes.NoSpecifiedGVGId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check in isAllowGCCheck, if the sp is in STATUS_GRACEFUL_EXITING, we do not gc for this sp;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the exiting SP, It can still do the GC, it does not matter

if !ok {
return true
}
return len(stats.SucceedSegments)+len(stats.FailedSegments) == stats.SegmentCount && len(stats.FailedSegments) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If either of these two occurs, it is an error?
len(stats.SucceedSegments)+len(stats.FailedSegments) != stats.SegmentCount || len(stats.FailedSegments) > 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(stats.SucceedSegments)+len(stats.FailedSegments) != stats.SegmentCount will be true if there is only 1 piece responded yet.

@constwz constwz merged commit 1818b8b into develop Dec 26, 2023
11 of 12 checks passed
@constwz constwz deleted the adapt-sp-exit branch December 26, 2023 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants