Refactor miner WindowPoSt state into per-partition aggregates and queues #648

anorth · 2020-07-13T06:13:53Z

For motivation, see #599.

This giant PR restructures the miner actor's state, representing partitions as first-class objects. Sectors, faults, recoveries, expiration and terminations are all tracked per-partition. A few totals of power and pledge are maintained in the partition so that power and penalty accounting for faults etc need not load the SectorOnChainInfos for all the sectors (which can be a lot). The heavy per-sector information is only loaded in miner-initiated messages, never from cron.

Significant:

Partitions are now numbered per-deadline, rather than miner-wide
When a message references a sector, it usually needs to include information about which deadline and partition the sector is allocated to. The miner must do this search off-chain to avoid on-chain search or indices.
New sectors are allocated to deadlines and partitions immediately, rather than waiting for cron.
A cron callback now happens at the end of every deadline, rather than every proving period (Per-deadline miner cron #552)
Penalties are now payed for the proving period in arrears, rather than in advance (in general, penalties are reduced)
The fault and expiration queues are combined into one (per partition)
Terminated sectors are not removed from partitions until an explicit Defrag method; they are masked over for Window PoSt like faults

TODO before merging:

Track initial pledge requirement in expiration queue, release immediately upon on-time termination
Terminate deals when sectors terminate, pay termination fee, release pledge
Deny withdrawals when there are pending termination fees
Quantize the expiration queue
At WPoSt, charge for recovering sectors (in arrears) before recovering them
Resolve all new TODOs and XXX

TODO follow-up:

Tests!
Implement CompactPartitions method Implement compaction of partitions, removing terminated sectors #673
For early termination, pay SP(t, power) immediately, pay the rest on termination processing (Pay part of early-termination fee on termination, defer the rest for async processing #674)
Increase the CBOR-gen maximum bitfield length to accomodate 2350 sector numbers (Increase cbor-gen max bytearray length to support maximum partition sectors bitfield #676)
Fix termination of entire miner (Miner termination from consensus fault has scale limitation #675)

Closes #391
Closes #357
Closes #391
Closes #411
Closes #418
Closes #483
Closes #519
Closes #535
Closes #552
Closes #593

Stebalien · 2020-07-13T17:48:05Z

The heavy per-sector information is only loaded in miner-initiated messages, never from cron.

ConfirmSectorProofsValid still does a lot of sector info loading (called by the power actor's OnEpochTickEnd, which is called by cron). However, the amount of work done in ConfirmSectorProofsValid is proportional to the amount of work paid for by the the miner in the same epoch.

Stebalien · 2020-07-13T18:14:10Z

Deny withdrawals when there are pending termination fees

Do we want to forbid sealing sectors, winning blocks, etc. as well? Or just forbid withdrawal.

Stebalien · 2020-07-13T18:36:47Z

Deal termination queue

It's looking like we're going to need to make this queue per-partition to deal with massive termination batches. If we do that, we're going to need:

A bitfield at the top level indicating which deadlines have deadlines that need terminating.
A bitfield at each deadline indicating which partitions have sectors that need terminating.

The other question here is how we're going to deal with early terminations. We don't want miner's withdrawing funds while they haven't payed per-deal early termination penalties.

We could (a) not include early terminations in this "normal termination" queue and (b) handle early deal termination in the early termination queue. This is the simplest approach but it means we have two different ways to handle deal termination.
We could combine the early termination queue with this queue (two bitfields). However, because the normal termination queue is processed in the background by the chain, this might encourage miners to just let the chain deal with its early termination fees (saving on gas) instead of calling a method to deal with them up-front.

(obviously leaning towards 1 here)

Stebalien · 2020-07-13T19:59:56Z

Deal termination queue

The other question is whether or not this actually needs to be a queue. Unless we expect it to fall behind regularly, I'm not sure if it does.

edit: It needs to be a queue, but only for early terminations (as far as I can tell?).

Stebalien · 2020-07-13T20:31:24Z

Proposal:

Force the miner to handle all fees/slashing/returning funds (sectors & deals) by calling a method to process early terminations. This way, we handle all fees in one place.
Set a limit on how long a miner can have outstanding fees before we terminate the miner. This way, clients are guaranteed to recover locked up funds for slashed deals within a certain amount of time.
Handle all state cleanup on defrag. Look ma, no cron!
Return pledge on state cleanup to encourage miners to actually defrag.

Am I missing something here?

edit: Yes, there's a per-deal client collateral.

anorth · 2020-07-14T04:34:14Z

@Stebalien and I talked about this.

Use the existing per partition early termination queue, and process both deal terminations and termination fee when traversing it
Only necessary for early terminations; we don't need to tell the market actor about on-time terminations
In cron, process a limited amount of the queue and if not yet empty, schedule again for next epoch
Manual termination should enqueue, and then process a large amount of the queue immediately (hopefully emptying it)
- Manual termination of zero sectors gives a manual queue-processing entry point if desired
Release pledge requirements as soon as we can; using it to incentivise defrags will result in an inefficiently high number of defrags

Defrag of a partition will have to wait for any early terminations to be processed.

actors/builtin/miner/miner_state.go

Stebalien · 2020-07-14T19:04:05Z

So, I've hit a bit of a snag. We garbage collect deals as soon as they expire. If we process early terminations late, we may try to access missing deals when we compute fines. Can we keep dead deals around for a bit? (and possibly bound how late we can process early termination fees?).

Stebalien · 2020-07-14T22:47:07Z

Answer: For manual terminations, always empty the queue before processing.

anorth · 2020-07-14T22:59:26Z

Can we keep dead deals around for a bit?

No. For 14-day fault terminations we'll just deal with this occasionally coming a few epochs late if we have a large queue to work through.

Stebalien · 2020-07-15T00:04:31Z

How important is it to process early terminations in-order? At the moment, I'm processing all early terminations deadline by deadline, partition by partition, epoch by epoch. That means I'll completely clear a partition before I move onto the next partition. I can process early terminations in order of termination epoch, it'll just require more state. Specifically, I'd need to turn the miner and deadline level "early termination" bitfields into queues.
We're going to need a queue mapping epoch to target reward and network QA power at that epoch to compute termination fees. Unless we can query something for these.

I can fix 2 without fixing 1. But in that case, I won't be able to clear the queue from 2 until I've processed all outstanding sector expirations.

anorth · 2020-07-15T00:08:47Z

I think out-of-order is ok for the simplicity gains of doing it by partition.

The state needed for termination fees must be written into the sector on-chain info. The BR(StartEpoch) can be computed from the initial pledge, and my impression was that that would suffice. We can store the BR explicitly if necessary.

actors/builtin/miner/deadline_state.go

actors/builtin/miner/miner_actor.go

anorth

Partial review

actors/builtin/market/market_actor.go

actors/builtin/miner/bitfield_queue.go

actors/builtin/miner/bitfield_queue_test.go

actors/builtin/miner/deadline_assignment.go

actors/builtin/miner/miner_actor.go

anorth · 2020-07-15T11:00:26Z

actors/builtin/miner/miner_actor.go

+
+// - Skipped faults that are not in the provided partition triggers an error.
+// - Skipped faults that are already declared (but not delcared recovered) are ignored.
+func processSkippedFaults(rt Runtime, st *State, store adt.Store, faultExpiration abi.ChainEpoch, partition *Partition,


Can/should we push this down to the state? We'd lose the IllegalArgument/IllegalState distinction, but then the state can maintain relations between the fields.

actors/builtin/miner/miner_actor.go

actors/builtin/miner/bitfield_queue.go

actors/builtin/miner/deadline_state.go

actors/builtin/miner/policy.go

actors/builtin/miner/deadline_state.go

actors/builtin/miner/expiration_queue.go

actors/builtin/miner/miner_state.go

actors/builtin/miner/termination.go

actors/builtin/reward/reward_test.go

actors/builtin/miner/miner_actor.go

actors/builtin/miner/expiration_queue.go

codecov-commenter · 2020-07-16T03:49:24Z

Codecov Report

Merging #648 into master will decrease coverage by 17.4%.
The diff coverage is 39.0%.

@@            Coverage Diff            @@
##           master    #648      +/-   ##
=========================================
- Coverage    68.4%   51.0%   -17.5%     
=========================================
  Files          44      50       +6     
  Lines        4787    5736     +949     
=========================================
- Hits         3279    2929     -350     
- Misses       1115    2436    +1321     
+ Partials      393     371      -22

This saves us some state updates.

anorth · 2020-07-16T04:20:00Z

I agree, checking more assumptions and preconditions in the partition would be worthwhile.

Stebalien

We'll need to handle downtime but that might be a question for another PR.

Stebalien · 2020-07-16T05:39:29Z

actors/builtin/miner/miner_actor.go

 		builtin.RequireNoErr(rt, err, exitcode.ErrIllegalState, "failed to load proven sector info")

-		// Skip verification if all sectors are faults
+		// Skip verification if all sectors are faults.
+		// We still need to allow this call to succeed so the miner can declare a whole partition as skipped.


Is there any reason to do this?

Stebalien · 2020-07-16T05:52:09Z

actors/builtin/miner/miner_actor.go

+
+					newSector := *sector
+					newSector.Expiration = decl.NewExpiration
+					//qaPowerDelta := big.Sub(QAPowerForSector(info.SectorSize, &newSector), QAPowerForSector(info.SectorSize, sector))


commented out code

Stebalien · 2020-07-16T06:38:48Z

actors/builtin/miner/miner_actor.go

+		// That way, don't re-schedule a cron callback if one is already scheduled.
+		hadEarlyTerminations = havePendingEarlyTerminations(rt, &st)
+
+		// Note: because the cron actor is not invoked on epochs with empty tipsets, the current epoch is not necessarily


What if we're multiple deadlines ahead? I'm guessing we should skip the missed deadlines and give the miner a pass, but we should probably do something.

Stebalien · 2020-07-16T06:41:23Z

actors/builtin/miner/miner_actor.go

+		// Increment current deadline, and proving period if necessary.
+		if dlInfo.PeriodStarted() {
+			st.CurrentDeadline = (st.CurrentDeadline + 1) % WPoStPeriodDeadlines
+			if st.CurrentDeadline == 0 {
 				st.ProvingPeriodStart = st.ProvingPeriodStart + WPoStProvingPeriod


Do we need to handle getting a full proving period behind?

Stebalien · 2020-07-16T06:42:26Z

actors/builtin/miner/miner_actor.go

-			// Set new proving period start.
-			if deadline.PeriodStarted() {
+		// Increment current deadline, and proving period if necessary.
+		if dlInfo.PeriodStarted() {


This can't be false, can it?

Stebalien · 2020-07-16T07:08:01Z

actors/builtin/miner/miner_actor.go

+
+			for i := uint64(0); i < partitions.Length(); i++ {
+				key := PartitionKey{dlInfo.Index, i}
+				proven, err := deadline.PostSubmissions.IsSet(i)


This is potentially slow. We should probably just expand this into a map (it's not too large). We can also do this with bitfield magic, but that's probably more complicated than it's worth.

Stebalien · 2020-07-16T07:08:44Z

actors/builtin/miner/miner_actor.go

+		// Accumulate sectors info for proof verification.
+		for _, post := range params.Partitions {
+			key := PartitionKey{params.Deadline, post.Index}
+			alreadyProven, err := deadline.PostSubmissions.IsSet(post.Index)


I'd expand PostSubmissions into a map.

Stebalien · 2020-07-16T07:11:24Z

actors/builtin/miner/miner_actor.go

+			penaltyTarget := PledgePenaltyForUndeclaredFault(epochReward, pwrTotal.QualityAdjPower, penalizePowerTotal)
+			// Subtract the "ongoing" fault fee from the amount charged now, since it will be added on just below.
+			penaltyTarget = big.Sub(penaltyTarget, PledgePenaltyForDeclaredFault(epochReward, pwrTotal.QualityAdjPower, penalizePowerTotal))
+			penalty, err := st.UnlockUnvestedFunds(store, currEpoch, penaltyTarget)


Can we do this once at the very end?

1. This needs to happen on compaction. 2. This can't happen here anyways.

It's kind of awkward to take slices of bitfields, this is something the caller should generally do.

Stebalien · 2020-07-16T17:34:47Z

I think all the important bits have either been implemented or recorded in new issues. My only remaining concern is dealing with too many null blocks, but we can extract that into a new issue as well.

And test termination result type.

anorth changed the title ~~Refactor miner WindowPoSt state into per-partition accounting units~~ Refactor miner WindowPoSt state into per-partition aggregates and queues Jul 13, 2020

anorth force-pushed the refactor/miner branch from 818e16c to 0f9c795 Compare July 14, 2020 06:19

magik6k reviewed Jul 14, 2020

View reviewed changes

actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved

actors/builtin/miner/miner_state.go Outdated Show resolved Hide resolved

magik6k mentioned this pull request Jul 14, 2020

Consume miner actor refactor filecoin-project/lotus#2413

Merged

3 tasks

anorth mentioned this pull request Jul 15, 2020

Implement compaction of partitions, removing terminated sectors #673

Closed

Stebalien reviewed Jul 15, 2020

View reviewed changes

actors/builtin/miner/deadline_state.go Show resolved Hide resolved

Stebalien reviewed Jul 15, 2020

View reviewed changes

actors/builtin/miner/miner_actor.go Show resolved Hide resolved

anorth commented Jul 15, 2020

View reviewed changes

anorth marked this pull request as ready for review July 15, 2020 11:04

magik6k mentioned this pull request Jul 15, 2020

Updates for refactored miner actor filecoin-project/storage-fsm#48

Merged

1 task

Stebalien force-pushed the refactor/miner branch from f8c0f1d to 0d08972 Compare July 15, 2020 17:21

Stebalien mentioned this pull request Jul 15, 2020

Add a MustGet method to the adt.Array type #677

Open

anorth mentioned this pull request Jul 15, 2020

Remove duplicate reads between AddLockedFunds and UnlockVestedFunds #678

Closed

Stebalien reviewed Jul 15, 2020

View reviewed changes

Stebalien force-pushed the refactor/miner branch from 42e9e92 to 7bc00e6 Compare July 15, 2020 23:39

anorth commented Jul 16, 2020

View reviewed changes

Stebalien reviewed Jul 16, 2020

View reviewed changes

actors/builtin/miner/miner_actor.go Show resolved Hide resolved

Stebalien reviewed Jul 16, 2020

View reviewed changes

actors/builtin/miner/expiration_queue.go Show resolved Hide resolved

anorth and others added 7 commits July 15, 2020 21:12

Fix lint errors

8aac624

Pre-quantize entries when adding many to the bitfield queue

8b38ff2

This saves us some state updates.

Fix method name nit

72d56e3

Fix and/or skip tests to restore full build

dec6dc1

Limit number of declared sector recoveries

078185f

Invalid input bitfields are illegal arguments

f1c1f0b

Remove unnecessary comments

3e9fbd6

Stebalien force-pushed the refactor/miner branch from 95e2eb5 to 3e9fbd6 Compare July 16, 2020 04:12

Stebalien reviewed Jul 16, 2020

View reviewed changes

This was referenced Jul 16, 2020

Improve algorithm for assigning sectors to deadlines #685

Closed

Remove WalkSectors abstraction in miner actor #686

Closed

Stebalien added 2 commits July 16, 2020 09:49

Don't delete partitions when we empty them

1084310

1. This needs to happen on compaction. 2. This can't happen here anyways.

Simplify LoadSectorInfosForProof

dad1d93

It's kind of awkward to take slices of bitfields, this is something the caller should generally do.

Stebalien mentioned this pull request Jul 16, 2020

Optimize sector removal from expiration queue #687

Merged

Stebalien and others added 3 commits July 16, 2020 13:18

Fix termination limit enforcement

da392a1

And test termination result type.

fix lints

85e3f42

Add tests for expiration set and queue (#691)

96a77d2

Stebalien approved these changes Jul 16, 2020

View reviewed changes

Stebalien merged commit 60a2ae9 into master Jul 16, 2020

anorth mentioned this pull request Jul 24, 2020

ProveCommitSector must only accept its valid worker as a caller #527

Closed

unjapones mentioned this pull request Oct 15, 2020

Port backend to javascript/typescript protofire/filecoin-CID-checker#47

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor miner WindowPoSt state into per-partition aggregates and queues #648

Refactor miner WindowPoSt state into per-partition aggregates and queues #648

anorth commented Jul 13, 2020 •

edited

Stebalien commented Jul 13, 2020

Stebalien commented Jul 13, 2020

Stebalien commented Jul 13, 2020

Stebalien commented Jul 13, 2020 •

edited

Stebalien commented Jul 13, 2020 •

edited

anorth commented Jul 14, 2020

Stebalien commented Jul 14, 2020

Stebalien commented Jul 14, 2020

anorth commented Jul 14, 2020

Stebalien commented Jul 15, 2020

anorth commented Jul 15, 2020

anorth left a comment

anorth Jul 15, 2020

codecov-commenter commented Jul 16, 2020 •

edited

anorth commented Jul 16, 2020

Stebalien left a comment

Stebalien Jul 16, 2020

Stebalien Jul 16, 2020

Stebalien Jul 16, 2020

Stebalien Jul 16, 2020

Stebalien Jul 16, 2020

Stebalien Jul 16, 2020

Stebalien Jul 16, 2020

Stebalien Jul 16, 2020

Stebalien commented Jul 16, 2020

Refactor miner WindowPoSt state into per-partition aggregates and queues #648

Refactor miner WindowPoSt state into per-partition aggregates and queues #648

Conversation

anorth commented Jul 13, 2020 • edited

Stebalien commented Jul 13, 2020

Stebalien commented Jul 13, 2020

Stebalien commented Jul 13, 2020

Stebalien commented Jul 13, 2020 • edited

Stebalien commented Jul 13, 2020 • edited

anorth commented Jul 14, 2020

Stebalien commented Jul 14, 2020

Stebalien commented Jul 14, 2020

anorth commented Jul 14, 2020

Stebalien commented Jul 15, 2020

anorth commented Jul 15, 2020

anorth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 16, 2020 • edited

Codecov Report

anorth commented Jul 16, 2020

Stebalien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Jul 16, 2020

anorth commented Jul 13, 2020 •

edited

Stebalien commented Jul 13, 2020 •

edited

Stebalien commented Jul 13, 2020 •

edited

codecov-commenter commented Jul 16, 2020 •

edited