Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: lotus-provider: SDR Sealing pipeline #11534

Merged
merged 124 commits into from
Feb 22, 2024
Merged

feat: lotus-provider: SDR Sealing pipeline #11534

merged 124 commits into from
Feb 22, 2024

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Dec 17, 2023

Related Issues

Proposed Changes

Additional Info

This will be a longer-running branch for the SDR sealing pipeline in lotus-provider

TODO:

  • Write an initial top-level schema for tracing sealing progress
  • Implement a basic adder
    • lotus-provider command starting CC sectors
  • Implement the pipeline using filecoin-ffi
    • SDR
    • TreeD
    • TreeR / TreeC / Replica encoding (as a call to PC2)
    • SynthPoRep (??)
    • PreCommit send
    • PreCommit wait
    • Commit (PoRep snark)
    • SubmitCommit
  • Test 2k sealing CC
  • Test 2k sealing with data
  • Basic fail recovery system
  • Basic CLI for tracking the state of things

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • If the PR affects users (e.g., new feature, bug fix, system requirements change), update the CHANGELOG.md and add details to the UNRELEASED section.
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@magik6k magik6k force-pushed the feat/lp-seal branch 2 times, most recently from 5c2bb1a to a4435d2 Compare December 18, 2023 20:43
provider/lpseal/task_sdr.go Outdated Show resolved Hide resolved
cmd/lotus-provider/pipeline.go Outdated Show resolved Hide resolved
cmd/lotus-provider/pipeline.go Show resolved Hide resolved
documentation/en/cli-lotus-provider.md Outdated Show resolved Hide resolved
documentation/en/cli-lotus-provider.md Outdated Show resolved Hide resolved
provider/lpffi/sdr_funcs.go Outdated Show resolved Hide resolved
max int
}

func NewPoRepTask(db *harmonydb.DB, api PoRepAPI, sp *SealPoller, sc *lpffi.SealCalls, maxPoRep int) *PoRepTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment what I need one of these for.

}

func (s *SubmitPrecommitTask) CanAccept(ids []harmonytask.TaskID, engine *harmonytask.TaskEngine) (*harmonytask.TaskID, error) {
id := ids[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These CanAccept()s are not picky enough to "enable everywhere and see what happens".
Can you add a bug+TODO on them?

It would be nice to default-enable most jobs in the future and have the system balance itself with CanAccept rather than hand-tuning.
We should be thinking about where on disk the various dependencies are if we want less data movement in a world with multiple sealing machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only tasks which care about data locality are:

  • SDR, which actually doesn't care about where it is started, because it has no large input data
  • SDRTrees, which needs all the data from SDR, so that one ideally would be smart about where it runs; At minimum we should make it try to run on the same node that SDR ran on in the first place if that is possible, but even that imo shouldn't be a blocker for this PR landing
  • Finalize, which really needs to run on the same machine that SDRTrees ran on (it already does that)
  • MoveStorage, which needs to run on a node with long-term storage, ideally "close" to the node that SDRTrees ran on

Tasks like SubmitPrecommit and others don't even touch the data files, so they don't really need to run on any specific node.

That said it would be really nice to have some standard way in CanAccept to express some priority with which the machine "wants" to run a task, ideally without any waiting being involved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
In discussing an order of work assigned, the order of the tasks put into harmonytask.New() is the order the poller assigns the work. Therefore having very-cheap tasks first is good, but having heavier tasks in a last-to-first order ensures that work later in the pipeline gets picked-up before work earlier in the pipeline.

cmd/lotus-provider/tasks/tasks.go Show resolved Hide resolved
max int
}

func NewTreesTask(sp *SealPoller, db *harmonydb.DB, sc *lpffi.SealCalls, maxTrees int) *TreesTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please comment what it does. Could these could link to a diagram?

provider/lpweb/hapi/routes.go Show resolved Hide resolved
@magik6k magik6k marked this pull request as ready for review February 20, 2024 16:07
@magik6k magik6k requested a review from a team as a code owner February 20, 2024 16:07
Copy link
Contributor

@snadrus snadrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few little things

cmd/lotus-provider/cli.go Outdated Show resolved Hide resolved

var configNewCmd = &cli.Command{
Name: "new-cluster",
Usage: "Create new coniguration for a new cluster",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do I use this? Where do I get SP actor addresses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the command to do that is lotus-shed miner create, I do want to create a separate tool for managing the miner actor, so then it will become sptool create

The commands that will live in sptool (vs curio / lotus-provider) would be the ones which interact with the miner actor, so things like create, managing balances, owner/control addresses, and other on-chain actions could be grouped in that binary.

(I just don't want to do all this in this PR, it's way too big already)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets refer to an open bug here with these goals in mind, that way it has a continue story for watchers.

}
}

if cctx.Bool("overwrite") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is a new cluster, how about we encourage "base"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion on this one, want to handle this one given that you're also handling the other onboarding paths? (lets us not block this PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm handling this elsewhere. Just add a few comments on what you're expectations are so I can do the right thing when I integrate this code with the other migration.

cmd/lotus-provider/storage.go Show resolved Hide resolved

var actorSetPeeridCmd = &cli.Command{
Name: "set-peer-id",
Usage: "set the peer id of your miner",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the peer ID used for? Should we mention that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is market node libp2p peer ID, which is where you send deal proposals and generally talk with the miner node.

Belongs in sptool is the future, when we build that tool (also imo this should just be automagically called by boost / market node on setup, but it's not..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is just a basic lotus-shed tool, those are really only meant for developer use, this one I just needed for devnet boost testing)

provider/lpseal/task_sdr.go Show resolved Hide resolved
provider/lpseal/task_trees.go Outdated Show resolved Hide resolved

// If 'active' is nil, initiate the HTTP request
if u.active == nil {
resp, err := http.Get(u.Url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have some kind of recovery here for over-the-Internet flakiness?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

range request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can swap this for a more robust http requester later. I have one in RIBS, and we can borrow at least some of it.

That said currently the lotus-shed boost proxy command won't support interrupted streams (that is a problem and it definitely needs to be fixed, but may require boost modifications), and currently there isn't a good way to start deals with data fetched straight from the internet.

I wouldn't block this PR on this for now, but it should be fixed together with improved boost<->curio adapter code.

provider/lpweb/hapi/robust_rpc.go Outdated Show resolved Hide resolved
id)
if err != nil {
//return xerrors.Errorf("updating storage health in DB fails with err: %v", err)
if harmonydb.IsErrSerialization(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this replaced by the serialization option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wrote this code before adding the option; Will change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is on db.Exec, not transaction; Should be done, but can be a follow up PR

@magik6k magik6k changed the title [wip] feat: lotus-provider: SDR Sealing pipeline feat: lotus-provider: SDR Sealing pipeline Feb 22, 2024
@magik6k magik6k merged commit cb67eeb into master Feb 22, 2024
88 checks passed
@magik6k magik6k deleted the feat/lp-seal branch February 22, 2024 16:10
@magik6k magik6k mentioned this pull request Feb 28, 2024
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants