Skip to content
This repository has been archived by the owner on Jun 6, 2023. It is now read-only.

Refactor miner WindowPoSt state into per-partition aggregates and queues #648

Merged
merged 89 commits into from
Jul 16, 2020

Conversation

anorth
Copy link
Member

@anorth anorth commented Jul 13, 2020

For motivation, see #599.

This giant PR restructures the miner actor's state, representing partitions as first-class objects. Sectors, faults, recoveries, expiration and terminations are all tracked per-partition. A few totals of power and pledge are maintained in the partition so that power and penalty accounting for faults etc need not load the SectorOnChainInfos for all the sectors (which can be a lot). The heavy per-sector information is only loaded in miner-initiated messages, never from cron.

Significant:

  • Partitions are now numbered per-deadline, rather than miner-wide
  • When a message references a sector, it usually needs to include information about which deadline and partition the sector is allocated to. The miner must do this search off-chain to avoid on-chain search or indices.
  • New sectors are allocated to deadlines and partitions immediately, rather than waiting for cron.
  • A cron callback now happens at the end of every deadline, rather than every proving period (Per-deadline miner cron #552)
  • Penalties are now payed for the proving period in arrears, rather than in advance (in general, penalties are reduced)
  • The fault and expiration queues are combined into one (per partition)
  • Terminated sectors are not removed from partitions until an explicit Defrag method; they are masked over for Window PoSt like faults

TODO before merging:

  • Track initial pledge requirement in expiration queue, release immediately upon on-time termination
  • Terminate deals when sectors terminate, pay termination fee, release pledge
  • Deny withdrawals when there are pending termination fees
  • Quantize the expiration queue
  • At WPoSt, charge for recovering sectors (in arrears) before recovering them
  • Resolve all new TODOs and XXX

TODO follow-up:

Closes #391
Closes #357
Closes #391
Closes #411
Closes #418
Closes #483
Closes #519
Closes #535
Closes #552
Closes #593

@anorth anorth changed the title Refactor miner WindowPoSt state into per-partition accounting units Refactor miner WindowPoSt state into per-partition aggregates and queues Jul 13, 2020
@Stebalien
Copy link
Member

The heavy per-sector information is only loaded in miner-initiated messages, never from cron.

ConfirmSectorProofsValid still does a lot of sector info loading (called by the power actor's OnEpochTickEnd, which is called by cron). However, the amount of work done in ConfirmSectorProofsValid is proportional to the amount of work paid for by the the miner in the same epoch.

@Stebalien
Copy link
Member

Deny withdrawals when there are pending termination fees

Do we want to forbid sealing sectors, winning blocks, etc. as well? Or just forbid withdrawal.

@Stebalien
Copy link
Member

Deal termination queue

It's looking like we're going to need to make this queue per-partition to deal with massive termination batches. If we do that, we're going to need:

  • A bitfield at the top level indicating which deadlines have deadlines that need terminating.
  • A bitfield at each deadline indicating which partitions have sectors that need terminating.

The other question here is how we're going to deal with early terminations. We don't want miner's withdrawing funds while they haven't payed per-deal early termination penalties.

  1. We could (a) not include early terminations in this "normal termination" queue and (b) handle early deal termination in the early termination queue. This is the simplest approach but it means we have two different ways to handle deal termination.
  2. We could combine the early termination queue with this queue (two bitfields). However, because the normal termination queue is processed in the background by the chain, this might encourage miners to just let the chain deal with its early termination fees (saving on gas) instead of calling a method to deal with them up-front.

(obviously leaning towards 1 here)

@Stebalien
Copy link
Member

Stebalien commented Jul 13, 2020

Deal termination queue

The other question is whether or not this actually needs to be a queue. Unless we expect it to fall behind regularly, I'm not sure if it does.

edit: It needs to be a queue, but only for early terminations (as far as I can tell?).

@Stebalien
Copy link
Member

Stebalien commented Jul 13, 2020

Proposal:

  1. Force the miner to handle all fees/slashing/returning funds (sectors & deals) by calling a method to process early terminations. This way, we handle all fees in one place.
  2. Set a limit on how long a miner can have outstanding fees before we terminate the miner. This way, clients are guaranteed to recover locked up funds for slashed deals within a certain amount of time.
  3. Handle all state cleanup on defrag. Look ma, no cron!
  4. Return pledge on state cleanup to encourage miners to actually defrag.

Am I missing something here?

edit: Yes, there's a per-deal client collateral.

@anorth
Copy link
Member Author

anorth commented Jul 14, 2020

@Stebalien and I talked about this.

  • Use the existing per partition early termination queue, and process both deal terminations and termination fee when traversing it
  • Only necessary for early terminations; we don't need to tell the market actor about on-time terminations
  • In cron, process a limited amount of the queue and if not yet empty, schedule again for next epoch
  • Manual termination should enqueue, and then process a large amount of the queue immediately (hopefully emptying it)
    • Manual termination of zero sectors gives a manual queue-processing entry point if desired
  • Release pledge requirements as soon as we can; using it to incentivise defrags will result in an inefficiently high number of defrags

Defrag of a partition will have to wait for any early terminations to be processed.

actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved
@Stebalien
Copy link
Member

So, I've hit a bit of a snag. We garbage collect deals as soon as they expire. If we process early terminations late, we may try to access missing deals when we compute fines. Can we keep dead deals around for a bit? (and possibly bound how late we can process early termination fees?).

@Stebalien
Copy link
Member

Answer: For manual terminations, always empty the queue before processing.

@anorth
Copy link
Member Author

anorth commented Jul 14, 2020

Can we keep dead deals around for a bit?

No. For 14-day fault terminations we'll just deal with this occasionally coming a few epochs late if we have a large queue to work through.

@Stebalien
Copy link
Member

  1. How important is it to process early terminations in-order? At the moment, I'm processing all early terminations deadline by deadline, partition by partition, epoch by epoch. That means I'll completely clear a partition before I move onto the next partition. I can process early terminations in order of termination epoch, it'll just require more state. Specifically, I'd need to turn the miner and deadline level "early termination" bitfields into queues.

  2. We're going to need a queue mapping epoch to target reward and network QA power at that epoch to compute termination fees. Unless we can query something for these.

I can fix 2 without fixing 1. But in that case, I won't be able to clear the queue from 2 until I've processed all outstanding sector expirations.

@anorth
Copy link
Member Author

anorth commented Jul 15, 2020

I think out-of-order is ok for the simplicity gains of doing it by partition.

The state needed for termination fees must be written into the sector on-chain info. The BR(StartEpoch) can be computed from the initial pledge, and my impression was that that would suffice. We can store the BR explicitly if necessary.

Copy link
Member Author

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

actors/builtin/market/market_actor.go Show resolved Hide resolved
actors/builtin/miner/bitfield_queue.go Outdated Show resolved Hide resolved
actors/builtin/miner/bitfield_queue_test.go Show resolved Hide resolved
actors/builtin/miner/bitfield_queue_test.go Show resolved Hide resolved
actors/builtin/miner/deadline_assignment.go Show resolved Hide resolved
actors/builtin/miner/miner_actor.go Show resolved Hide resolved

// - Skipped faults that are not in the provided partition triggers an error.
// - Skipped faults that are already declared (but not delcared recovered) are ignored.
func processSkippedFaults(rt Runtime, st *State, store adt.Store, faultExpiration abi.ChainEpoch, partition *Partition,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should we push this down to the state? We'd lose the IllegalArgument/IllegalState distinction, but then the state can maintain relations between the fields.

actors/builtin/miner/miner_actor.go Show resolved Hide resolved
actors/builtin/miner/miner_actor.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_actor.go Outdated Show resolved Hide resolved
actors/builtin/miner/bitfield_queue.go Outdated Show resolved Hide resolved
actors/builtin/miner/deadline_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/policy.go Show resolved Hide resolved
actors/builtin/miner/policy.go Show resolved Hide resolved
actors/builtin/miner/deadline_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/expiration_queue.go Show resolved Hide resolved
actors/builtin/miner/expiration_queue.go Show resolved Hide resolved
actors/builtin/miner/expiration_queue.go Show resolved Hide resolved
actors/builtin/miner/expiration_queue.go Show resolved Hide resolved
actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_state.go Show resolved Hide resolved
actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved
actors/builtin/miner/termination.go Outdated Show resolved Hide resolved
actors/builtin/reward/reward_test.go Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Jul 16, 2020

Codecov Report

Merging #648 into master will decrease coverage by 17.4%.
The diff coverage is 39.0%.

@@            Coverage Diff            @@
##           master    #648      +/-   ##
=========================================
- Coverage    68.4%   51.0%   -17.5%     
=========================================
  Files          44      50       +6     
  Lines        4787    5736     +949     
=========================================
- Hits         3279    2929     -350     
- Misses       1115    2436    +1321     
+ Partials      393     371      -22     

@anorth
Copy link
Member Author

anorth commented Jul 16, 2020

I agree, checking more assumptions and preconditions in the partition would be worthwhile.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to handle downtime but that might be a question for another PR.

builtin.RequireNoErr(rt, err, exitcode.ErrIllegalState, "failed to load proven sector info")

// Skip verification if all sectors are faults
// Skip verification if all sectors are faults.
// We still need to allow this call to succeed so the miner can declare a whole partition as skipped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to do this?


newSector := *sector
newSector.Expiration = decl.NewExpiration
//qaPowerDelta := big.Sub(QAPowerForSector(info.SectorSize, &newSector), QAPowerForSector(info.SectorSize, sector))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out code

// That way, don't re-schedule a cron callback if one is already scheduled.
hadEarlyTerminations = havePendingEarlyTerminations(rt, &st)

// Note: because the cron actor is not invoked on epochs with empty tipsets, the current epoch is not necessarily
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we're multiple deadlines ahead? I'm guessing we should skip the missed deadlines and give the miner a pass, but we should probably do something.

// Increment current deadline, and proving period if necessary.
if dlInfo.PeriodStarted() {
st.CurrentDeadline = (st.CurrentDeadline + 1) % WPoStPeriodDeadlines
if st.CurrentDeadline == 0 {
st.ProvingPeriodStart = st.ProvingPeriodStart + WPoStProvingPeriod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle getting a full proving period behind?

// Set new proving period start.
if deadline.PeriodStarted() {
// Increment current deadline, and proving period if necessary.
if dlInfo.PeriodStarted() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be false, can it?


for i := uint64(0); i < partitions.Length(); i++ {
key := PartitionKey{dlInfo.Index, i}
proven, err := deadline.PostSubmissions.IsSet(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially slow. We should probably just expand this into a map (it's not too large). We can also do this with bitfield magic, but that's probably more complicated than it's worth.

// Accumulate sectors info for proof verification.
for _, post := range params.Partitions {
key := PartitionKey{params.Deadline, post.Index}
alreadyProven, err := deadline.PostSubmissions.IsSet(post.Index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expand PostSubmissions into a map.

penaltyTarget := PledgePenaltyForUndeclaredFault(epochReward, pwrTotal.QualityAdjPower, penalizePowerTotal)
// Subtract the "ongoing" fault fee from the amount charged now, since it will be added on just below.
penaltyTarget = big.Sub(penaltyTarget, PledgePenaltyForDeclaredFault(epochReward, pwrTotal.QualityAdjPower, penalizePowerTotal))
penalty, err := st.UnlockUnvestedFunds(store, currEpoch, penaltyTarget)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this once at the very end?

1. This needs to happen on compaction.
2. This can't happen here anyways.
It's kind of awkward to take slices of bitfields, this is something the caller
should generally do.
@Stebalien
Copy link
Member

I think all the important bits have either been implemented or recorded in new issues. My only remaining concern is dealing with too many null blocks, but we can extract that into a new issue as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.